Editor’s note (2026-04-20 audit): An earlier version of this article marked nemotron-3-super as NOT RUNNABLE and reported a 14-minute cold-start on glm-5.1. Both were wrong. nemotron-3-super:cloud exists as a hosted model alongside the local-weights quants; I confused the two. The 14-minute glm-5.1 timing was a harness error on our end, not model behavior. A first audit pass on 2026-04-20 re-harnessed nemotron for one session and excised the glm-5.1 error. A second pass the same day expanded nemotron to three sessions for parity with the other eight models’ retest protocol, and surfaced two consistency issues that a single session would have hidden — now discussed in §4. The scorecard, axis tables, Tier summary, and per-model cards below reflect the final 3-session aggregates.

If you arrived here from the Anthropic subscription change piece — that’s the “what happened.” This is the “what’s the alternative, and which ones actually work.”

Over a Friday night into Saturday morning two weeks ago (plus two audit passes on the publish-day Monday), I ran nine hosted LLMs through a real ops shift and scored them on the dimensions that matter for managed-service work. Not a chat benchmark. Not a reasoning score. An operator trial: does the model read the dashboard, pick the right ticket, call the right tool, hand off cleanly when its context fills, and — critically — not lie about having done work it never did.

Eight of the nine passed. Three of those are Tier 1 for operator work today. Four cluster into “viable with caveats.” One hung halfway through the first ritual and failed.

This is the full write-up. If you only want the scorecard, it’s a few scrolls down. If you want to reproduce the trial yourself, scroll to the bottom — there’s a summary of the methodology bundle and how to request the full source.

Why this matters to an MSP — or any small business running an operator

Managed service providers live or die on L1 triage. Someone answers the phone or the chat, reads the dashboard, decides what’s urgent, keeps the client informed, and knows when to escalate. That’s not a coding task. It’s not a reasoning-benchmark task. It’s a tool-picking, context-holding, don’t-hallucinate, know-when-to-say-“I’m-not-sure” task — and it runs for hours or days without the operator losing the thread.

Small businesses that deploy AI have effectively the same requirement, just at smaller scale. The “AI assistant” answering emails, booking appointments, or triaging support tickets has to do the same things: call the right tool, hold the context, not make things up. Every failure mode that matters in an MSP operator matters in an SMB customer-service assistant, just with fewer clients in the blast radius.

The industry numbers on this are brutal. RAND put 2025 AI-agent project failure in production at 80–90%. MIT put company-wide AI deployment failure at 95%. The reason cited most often is “execution gap” — models that reason well on isolated prompts but can’t actually move work through a queue without losing track of state or fabricating an all-clear.

So I didn’t test for benchmark scores. I tested for operator reliability. Five dimensions:

  • Tool fidelity — does the model call the right tool with the right arguments, or does it hallucinate tool names and fake “done” messages?
  • Correctness — does its answer match the ground truth in the fixture data?
  • Latency — wall-clock time from prompt to response, in warm state.
  • Context discipline — does it re-read files it already has in context?
  • Handoff integrity — when its context fills and it must pass work to a successor generation of itself, does the successor get a usable brief?

The headline result: zero hallucinations across the eight that completed the trial. That’s a higher floor than I expected, and it’s worth more than any single benchmark score.

Model selection

All nine candidates came from ollama.com filtered by three simultaneous criteria as of April 18:

  • Cloud — hosted inference (not local weights)
  • Tools — declared tool-use capability
  • Thinking — declared reasoning / thinking mode

That filter produced the nine-model list below. All nine carry a :cloud tag on ollama.com and ran on the same harness.

The trial harness was a standard Ollama Cloud invocation wrapping the Claude Code CLI. Each model got a fresh session, was registered as the active operator against a Signal group I use for ops dispatch, and was driven through six operator prompts plus an autonomous-handoff trigger.

The subject matter was a set of synthetic ops fixtures — a TODO list, a service-health JSON, an incidents log — accessible through a purpose-built stdio MCP server with eight synthetic tools (mock_todo_list, mock_service_health, mock_incidents_list, plus action tools like mock_restart_service and mock_rotate_api_key that return plausible confirmations without touching any real systems). Using MCP tools instead of raw file reads isolates tool-picking fidelity as a measurable dimension.

Three retest passes per model across all nine — the original eight on the April 18–19 run and the nemotron-3-super audit set on April 20. 47 JSONL session traces in total. Aggregated metrics below.

The scorecard

# Model Verdict Median warm latency Tool fidelity Handoff preserved Notable
1glm-5.1:cloudPASS24s P0, ~40s warm0 hallucinations + emergent cross-refCross-file reasoning leader; emergent T1↔monitoring insight
2minimax-m2.7:cloudSTRONG PASS10scleanFastest median
3gemma4:31b-cloudPARTIAL35scross-ref missWeakest on synthesis
4nemotron-3-super:cloudPASS~7sclean, 62% productive (3 sessions; 43–80% range)not testedHigh variance session-to-session; pager-short zero-emoji replies
5qwen3.5:397b-cloudSTRONG PASS12scleanLive-traffic-first prioritization
6kimi-k2.5:cloudPASS19scleanSolid mid-tier
7gemini-3-flash-preview:cloudSTRONG PASS6scleanFastest single prompts (4–5s floor)
8deepseek-v3.2:cloudFAILhung on multi-fileRepeatable hang on synthesis prompts
9gpt-oss:120b-cloudPASS10scleanReliable, probe-heavy

Three axes the usual benchmarks don’t show

Warm latency and tool fidelity are the obvious dimensions. Three more emerged from the JSONL traces, and they turn out to matter more than either for anyone running an operator-style deployment.

Axis A — Operational lifetime (turns until a shift ends)

An operator’s shift ends when its context fills. Context grows every turn — user messages, tool results, the model’s own output all accumulate. The useful number isn’t the advertised context window size; it’s how many ops-ritual turns actually fit inside that window before the operator has to hand off.

Using measured avg_input_tokens_growth_per_turn from each model’s traces, and projecting 80% of the advertised context window as the handoff trigger:

ModelContext windowAvg growth / turnOperational lifetime (turns)
gemini-3-flash-preview1,000,000~2,300~351
glm-5.1200,000~2,50064
nemotron-3-super262,144~3,30062
minimax-m2.7200,000~2,60061
kimi-k2.5262,144~3,40061
qwen3.5:397b262,144~4,60045
gpt-oss:120b131,072~2,90036
gemma4:31b131,072~3,30032
deepseek-v3.2163,840N/A (hung)N/A

Gemini-3-flash’s 1M context translates to roughly 5x the operational lifetime of the 200k-class competitors. A shift that the 200k models break after about two hours, gemini runs for the better part of a day without a handoff. For a pager-rotation operator that’s not a latency win — it’s a staffing win. One operator generation can cover an entire overnight watch instead of self-handing-off every couple of hours.

Qwen has the second-largest declared context but the shortest usable lifetime in its class, because its input-growth-per-turn runs about 50% higher than the pack. It retains more of every tool result inside its working reasoning. That’s great for depth on a single hard problem; it’s a handicap on a long watch.

Axis B — Tool-use efficiency (productive vs probe)

Every tool call is one of two types:

  • Productive — the synthetic ops tools or the outbound message tool (the work the operator is actually paid to do).
  • Probe — Bash, Read, Grep, Glob (self-exploration, file inspection, environment checks).

Probe calls burn context without advancing the queue.

ModelTotal tool callsProductiveProbeProductive ratio
nemotron-3-super34211361.8% (3 sessions)*
qwen3.5:397b110397035.5%
gemini-3-flash-preview125447935.2%
kimi-k2.5112397334.8%
gemma4:31b106357133.0%
minimax-m2.7149489732.2%
glm-5.1114357930.7%
gpt-oss:120b98178117.3%
deepseek-v3.21521113.3% (failed)

* Nemotron-3-super was added via the 2026-04-20 audit and ran three full passes for parity with the other eight models’ retest protocol. Per-session productive ratios: 70.0% (gen 1), 42.9% (gen 2), 80.0% (gen 3) — mean 61.8%, range 42.9–80.0% (37-pp spread). That spread is in line with the per-session variance the other models show on this axis; the 61.8% mean, however, is roughly 25 pp above the pack’s 30–35% cluster and worth noting even with the caveat.

The working pack clusters around 30–35%, which is about right for the six-prompt ritual — each prompt legitimately needs two or three probes to load files and check state before the productive call.

The gpt-oss:120b outlier is the interesting one. 83% of its tool calls were probes — meaning it kept re-reading files already in context and shelling out for things it already knew. That’s both a context waster (pushes its short 120k window faster) and a latency tax (probes take real time). The model is reliable and it gets to right answers. It just works harder than it needs to. For a business deployment, that means higher running cost per resolved ticket.

Axis C — Wordiness (characters and emoji per outbound message)

Operators communicate via short outbound messages. The payload of each send is pure operator prose — not reasoning tokens, not tool results, just the reply the human reads. That’s the right place to measure communicative style.

ModelAvg chars / messageAvg prose / turnEmoji per 1k charsStyle
glm-5.17701541.02Paragraph-first, dense, freely uses Markdown tables
qwen3.5:397b5391171.21Structured, bullet-heavy, action-ready phrasing
minimax-m2.74601080.58Terse, priority-queue format, no emoji
kimi-k2.54531001.60Emoji-forward, upbeat tone
nemotron-3-super401372†0.00Reply length varies widely (305–548 chars across 3 sessions); long internal reasoning; zero emojis
gpt-oss:120b385750.90Compact, bullet-heavy, low ceremony
gemini-3-flash-preview258520.00Telegram-short, all-business, zero emojis
gemma4:31b208460.00Very compact, fragmentary
deepseek-v3.27800.00(only partial run)

Nemotron’s combined-prose-per-turn figure is inflated by its internal reasoning text blocks; the 401-character outbound average (range 305–548 across three sessions) is the better indicator of operator-facing tone.

Three style clusters emerge:

  • The short-message operator (gemini-3-flash, gemma4, gpt-oss) — under 400-character replies, factual, no ornament. Feels like a pager alert.
  • The briefing operator (minimax, kimi, qwen, nemotron on its long-reply days) — 400–550 characters, structured Markdown, willing to add context. Feels like a stand-up update.
  • The analyst operator (glm-5.1) — 750+ characters, tables, multi-paragraph. Feels like a morning report.

Kimi’s 1.60 emoji-per-1k is the highest in the trial; minimax’s 0.58 is the lowest among the models that use any emoji at all. Gemini, gemma, and nemotron emit zero across every session observed.

This is not a best/worst axis. It’s a fit axis. An internal on-call pager probably wants gemini’s brevity. A client-facing CEO briefing email probably wants glm-5.1’s depth. Match the model’s natural register to the audience, rather than forcing a single voice everywhere.

Per-model notes

1. glm-5.1:cloud

“GLM-5.1 is our next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor.” — ollama.com

Five rerun sessions under the audited harness. Prompt 0 (a five-part environment-discovery query) came back in 24 seconds; every warm-state prompt averaged 25–50 seconds. Not snappy like gemini or minimax, but never stalled.

Editor’s note from the 2026-04-20 audit pass: an earlier draft reported a 14-minute P0 “cold-start” on glm-5.1’s first session. That latency turned out to be our own harness issue, not model behavior, and has been excised. All glm-5.1 axis numbers above are now computed from the five rerun sessions only.

The single best finding from glm-5.1: on a cross-file reasoning prompt (combining the incidents log, the TODO list, and the service-health feed), it surfaced an unprompted insight — the API key rotation at the top of the TODO list is the monitoring chain’s linchpin, because the metrics exporter that feeds the dashboards runs under that key. If the key expires mid-investigation, the dashboards go dark exactly when you need them most. Nothing in the prompt suggested the model look for that chain. Multiple runs caught it.

Verdict: Tier 2. Depth-of-reasoning pick; slower warm-state than the Tier 1 speedsters but the only model that reliably surfaced the emergent T1↔monitoring cross-ref.

2. minimax-m2.7:cloud

“MiniMax’s M2-series model for coding, agentic workflows, and professional productivity.” — ollama.com

This one had failed under a different agent harness a few weeks before the trial — it had hallucinated a “message delivered” confirmation without actually calling the send tool. Under the Claude Code harness used in this trial, same model, no issues. Six prompts in about 100 seconds of total wall clock. Zero hallucinations. Clean handoff with model preservation.

The retest result says the earlier incident was the prior harness’s tool-call parser, not the model. Minimax belongs in Tier 1.

Verdict: Tier 1. Fastest median latency in the trial.

3. gemma4:31b-cloud

“Gemma 4 models are designed to deliver frontier-level performance at each size. Well-suited for reasoning, agentic workflows, coding, and multimodal understanding.” — ollama.com

Mixed run. Good tool fidelity — used the incidents query with the cleanest argument filtering of any model in the trial. But on the cross-file reasoning prompt it said the active latency incident matched the “degraded” status in service-health and wasn’t directly related to any item in the TODO. That’s factually true but incomplete — the chain-linchpin reasoning glm-5.1 and minimax both caught is exactly the kind of synthesis an ops operator needs.

It also sent an “interim” ack message — “I’m gathering those details now and will provide the full list in my next message” — before actually completing Prompt 0. That’s a behavioral tell: gemma likes to ack first, work later. For a real operator under human observation that’s fine. For one running autonomously against a queue, it’s a hazard.

Verdict: Tier 3. Fine for straight lookups; not ready for the synthesis work L1 triage actually needs.

4. nemotron-3-super:cloud

“NVIDIA Nemotron 3 Super is a 120B open MoE model activating just 12B parameters to deliver maximum compute efficiency and accuracy for complex multi-agent applications.” — ollama.com

An earlier draft of this article marked nemotron as NOT RUNNABLE. That was wrong: nemotron-3-super:cloud exists alongside an 86-GB local quant (the local tag is what had errored out at load time on the original host). On the 2026-04-20 audit passes I reran it through the same harness as the other eight models. ollama show nemotron-3-super:cloud reports 120B params, 12B activations, 262K context, NVFP4, completion + thinking + tools.

Three audit sessions for parity with the other eight models’ three-retest-pass protocol. All three ran P0 → P5 cleanly; zero hallucinations across the set. Aggregate numbers: 61.8% productive ratio (range 42.9–80.0% across gens 1/2/3), 401 average characters per outbound message (range 305–548), zero emojis, operational lifetime ~62 turns, ~7-second median warm latency.

Two quirks that the multi-session pass surfaced and that a single session would have hidden:

  1. P3 prioritization inconsistency. Two sessions (gens 1 and 3) chose backup-runner first on the service-health triage (consensus framing: RPO breach, data loss risk). One session (gen 2) chose mqtt-broker first (qwen-style framing: live-traffic impact beats chronic staleness). Same prompt, same fixture data, same model — different operator philosophy. That’s a consistency red flag for a Tier 1 client-facing operator: you don’t want the first-pick to swing on which successor generation happens to answer the pager.
  2. P0 self-report drift. Each session reported a different CLAUDE.md line count (gen 1: 50, gen 2: 104, gen 3: 32) and a different advertised context window (gen 1: “not explicitly stated”, gen 2: 32,768, gen 3: 8,192). The actual CLAUDE.md is 104 lines, and ollama show nemotron-3-super:cloud reports 262,144 ctx. So gen 2 was right on the file count and wrong on context; gens 1 and 3 were wrong on both. Across three sessions the model never correctly matched its advertised capability to its own introspection.

Behaviorally: nemotron’s outbound style clusters with the briefing operators on average (400-character replies), but the per-session range spans from pager-short (305) to full briefing (548). Got the obvious mqtt↔INCIDENT cross-ref on the synthesis prompt in all three sessions, but never surfaced the emergent T1↔monitoring insight that glm-5.1 and minimax both caught.

The handoff axis remains untested. P6 autonomous-handoff was deliberately skipped across all three audit sessions to avoid spawning multiple successor generations during a docs-correction pass. A follow-up trial with the handoff configuration pointing at nemotron would close that axis.

Verdict: Tier 2. Fast and clean on tool fidelity, but the P3 prioritization inconsistency and the P0 self-report drift across three sessions are the kind of thing you want consistent on a client-facing operator. Needs the handoff-preservation verification and ideally a harness change that pins its context-window introspection before moving up to Tier 1.

5. qwen3.5:397b-cloud

“Qwen 3.5 is a family of open-source multimodal models that delivers exceptional utility and performance.” — ollama.com

The philosophically interesting one. On the service-health triage prompt — two items needed attention, a backup runner stale for two days and a message broker degraded with 780ms tail latency since yesterday — every other model picked backup-runner first, arguing data-loss risk trumps performance risk.

Qwen picked the message broker first, with this reasoning:

“it’s actively degraded with high tail latency affecting live traffic … The backup-runner is stale but not actively harming users; the broker is.”

Both are defensible operator judgments. The others lean on severity-class (data loss > latency). Qwen leans on live-traffic impact (acute > chronic). For an SLA-driven shop this framing is arguably more aligned with the business, not less. Different model, different default stance.

Also worth flagging: qwen’s response to the API key rotation prompt ended with “I can execute the rotation now if you approve.” It volunteered to actually call the rotation tool. Others reported the recommendation and waited. Qwen leans more action-ready.

Verdict: Tier 1. Include on the ship list, but flag the “live-traffic-first” disposition so clients know what they’re getting.

6. kimi-k2.5:cloud

“Kimi K2.5 is an open-source, native multimodal agentic model that integrates vision and language understanding with advanced agentic capabilities.” — ollama.com

Middle of the pack. Median latency ~19 seconds, zero hallucinations, correct cross-referencing on the synthesis prompt, clean handoff. No distinctive quirks — which is itself a good sign for an operator role. You don’t want distinctive quirks; you want consistency.

Verdict: Tier 2. Reliable second choice.

7. gemini-3-flash-preview:cloud

“Gemini 3 Flash offers frontier intelligence built for speed at a fraction of the cost.” — ollama.com

The tagline is not lying. Three consecutive prompts returned in four seconds each — prompt delivery to message reply. The per-prompt median was six seconds; only the standup ritual (which runs about ten bash commands and loads a skill) stretched past twenty. For interactive operator work this feels instant.

Gemini’s 1M context window is advertised but went unused at this trial scale. The six-prompt sequence consumed about 34% of some models’ contexts; gemini would have absorbed the whole trial without noticing.

Verdict: Tier 1. Speed leader, big context headroom, clean on every axis.

8. deepseek-v3.2:cloud

“DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance.” — ollama.com

Prompt 0 was excellent — the only model in the trial to successfully self-report its context window size (163,840 tokens, via an ollama show probe). Thorough diagnostic, accurate file analysis. Took 46 seconds but looked on track.

Then Prompt 1 arrived. The standup skill loaded. The skill’s instruction text appeared in context. Deepseek stopped. Not a crash, not a hallucination, not a fake “delivered” message — just stopped. No tool calls, no outbound message, no progress. I waited five minutes before calling it.

A follow-on experiment bypassed the standup skill entirely and handed deepseek the six operator prompts directly. Results:

  • P0 (enumeration): API Error 400, malformed request.
  • P2 (TODO triage): PASS.
  • P3 (health triage): HUNG.
  • P4 (service-health cross-ref): PASS.
  • P5 (incidents cross-ref): HUNG — same failure mode as the standup run.

Bypassing the multi-step skill doesn’t rescue the model. The hang correlates with prompts that require multi-file reasoning (P3, P5). The standup skill failure was incidental to its multi-step structure; the underlying issue is something else.

Verdict: Tier 3. Not operator-viable at this snapshot. Upstream diagnosis required.

9. gpt-oss:120b-cloud

“OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases.” — ollama.com

Solid, unremarkable, reliable. Median warm prompt latency ~8 seconds, standup ritual at 56s, handoff at 51s to a successful model-preserved successor generation. Smallest context window in the list (120k) but not stressed by this trial.

This is what a “safe default” Tier 2 model looks like. If I had to pick a single model to hand a client who just wanted “something reliable that does the job without drama,” this is probably it. The probe-ratio finding means it costs slightly more to run per resolved ticket than the Tier 1 models, but that’s a known, small, steady number — not a cliff.

Verdict: Tier 2. Ship when you want deterministic reliability over peak speed.

Tier summary

Tier 1 — ship to clients

  • minimax-m2.7:cloud — fastest median latency, clean on every axis
  • gemini-3-flash-preview:cloud — speed leader plus 1M context (5x operational lifetime)
  • qwen3.5:397b:cloud — strong, with a live-traffic-first disposition to flag

Tier 2 — viable with caveats

  • glm-5.1:cloud — depth-of-reasoning but slower warm-state; best emergent synthesis of any model tested
  • kimi-k2.5:cloud — reliable middle-of-pack
  • gpt-oss:120b:cloud — deterministic but probe-heavy; small context
  • nemotron-3-super:cloud — 3-session audit rerun PASS; the P3 prioritization swing and the P0 self-report drift are the consistency gaps to close, plus the handoff-preservation verification, before it moves up

Tier 3 — don’t ship

  • gemma4:31b:cloud — cross-file synthesis weakness; ack-first pattern
  • deepseek-v3.2:cloud — hung on multi-file reasoning prompts, repeat failure

What this trial doesn’t test

  • Long-session context erosion. The six-prompt sequence consumed 7–35% of advertised context. Real operator shifts run for days.
  • Cold-start behavior of any model. All trials ran during warm windows on Ollama Cloud. A genuine cold-shift test requires coordinated overnight-idle retests that I haven’t run. (An earlier draft of this article claimed a 14-minute cold-start on glm-5.1; that was a harness error on our end, not model behavior, and has been excised.)
  • nemotron-3-super handoff integrity. Seven of the nine models had their autonomous-handoff verified end-to-end; deepseek-v3.2 never reached that prompt, and all three nemotron audit sessions deliberately skipped it to avoid spawning multiple successor generations during a docs-correction pass. A follow-up nemotron trial closes that last axis.
  • Concurrent-load behavior. All trials were single-operator.
  • Tool-call error recovery. The synthetic ops tools never failed. A real ops tool returning a 500 would reveal another reliability axis.
  • Longer autonomous-handoff chains. Each trial tested exactly one handoff. Chains of three or more handoffs on the same model aren’t covered.

These are the obvious next trials. The Tier 1 models get the full long-shift + concurrent-load + error-recovery matrix first; nemotron-3-super gets the handoff verification and ideally a fourth or fifth session before a Tier 1 reconsideration.

What’s in the methodology bundle

The trial is reproducible. The methodology bundle — which I’m happy to send on request — contains:

  • The mock-ops MCP server source (~250 lines of Python stdio, 8 synthetic ops tools returning plausible JSON without touching any real system).
  • The fixture files used as subject matter (synthetic TODO, service-health, incidents log).
  • The reproduction walkthrough — step-by-step from “pull the host bundle” through “launch the operator” to “score from JSONL” — including the scoring script that produced the numbers in this article.
  • The incident sidebar — four harness infrastructure fixes that shipped mid-trial when the trial itself stressed the framework (dispatcher fall-through, handoff-script launch-command override, session-ID lookup correctness, and a stdin-submission gotcha that compresses trial time from ~17 minutes per model to ~2–3).
  • The median transcripts, including the full nine-model verbatim signal-send corpora and the three nemotron audit-rerun transcripts (gen 1 is the numerical median of the three).
  • The raw JSONL traces (47 files) and the aggregated trial_summary.v2.json that underlies every number in the tables above. Every metric in this article is reproducible from that data.

If you’re evaluating one of these models for your own deployment and want to run the same trial against a private fixture set of your own, that bundle is the starting point. Email jon@sandhillscto.com with the subject line “Ollama trial bundle” and I’ll send it over.

A note on methodology and the agents that ran the harness

This trial was designed, scoped, and iteratively refined by me. The model-selection criteria, the ops-ritual prompt sequence, the five reliability dimensions, the three scoring axes beyond latency (operational lifetime, tool-use efficiency, wordiness), the tier-assignment rubric, and the editorial conclusions in this article are all my work.

The implementation and structured input/output — running each model through the harness, driving the terminal sessions, collecting JSONL traces, computing aggregate metrics, and producing the reproducible summary JSON — was done by an agent running Claude Opus 4.7 with 1M context. That’s the “routine stuff”: a lot of repetitive session-starting, prompt-injecting, trace-reading, and number-crunching that would have taken me a week of evenings to do by hand.

A second agent, also running Claude Opus 4.7, performed the 2026-04-20 audit work — first a single-session nemotron rerun to fix the NOT RUNNABLE miscalling, then, once I noticed the scoring axes needed parity with the other eight models, an expanded three-session pass. The second pass is what surfaced the P3 prioritization inconsistency and the P0 self-report drift that a one-session read would never have caught. The pattern generalizes: if a model’s numbers look surprisingly clean on a single session, run two more before publishing anything downstream-consequential.

I call this out for two reasons. First, for integrity: you should know what I touched directly, what I delegated, and when a fact in the article has been revised. Second, because the pattern itself is worth naming. The trial — and the audit passes on top of it — is an example of the deployment shape I wrote about in 40% of Small Businesses Will Have an AI Agent by December — scoped job, defined success criteria, bounded permissions, a human reviewing the output, and a correction loop when the first pass is wrong. It is exactly the kind of work an agent is good at and a human shouldn’t be doing, and it’s how I’d shape any comparable deployment for a client.

The reproducibility bundle described above lets any reader verify every number independently, including the corrections. That is the guardrail on the agent’s half of the work.

The bottom line

The industry’s headline number — 80%+ of AI agent projects fail in production — is almost certainly true. But the failure isn’t because the models are weak. Eight of nine cloud-hosted models I tested passed a real operator ritual with zero hallucinations.

The failure comes from picking the wrong model for the job, or running a model that passes single-turn benchmarks but can’t hold context across a shift, or deploying one whose style is mismatched to the audience, or — most commonly — not actually measuring any of this before putting a client in front of it.

A 90-minute trial against a synthetic ops fixture isn’t a substitute for production experience. But it is a substitute for picking a model based on a vendor blog post. For anyone making the switch decision right now — whether that’s because of the Anthropic re-pricing, a contract renewal coming up, or just the dawning sense that “we’re running on Model X because we always have” isn’t a strategy — the three Tier 1 models are where I’d start a real evaluation.

The Bottom Line

  • Nine Ollama Cloud models tested against a realistic operator workload over ~48 hours plus two audit passes. Eight passed. Zero hallucinations across the passers.
  • Tier 1 (ship): minimax-m2.7, gemini-3-flash-preview, qwen3.5:397b.
  • Tier 2 (viable with caveats): glm-5.1 (depth-of-reasoning, slower warm-state), kimi-k2.5, gpt-oss:120b (probe-heavy), nemotron-3-super (3-session audit; P3 prioritization swing and P0 self-report drift to close before a Tier 1 move).
  • Tier 3 (don’t): gemma4:31b (synthesis weakness), deepseek-v3.2 (multi-file hang).
  • The hidden axes that matter as much as latency: operational lifetime (how many turns before handoff), tool-use efficiency (productive vs probe call ratio), and wordiness (matching operator voice to audience).
  • Full methodology bundle — harness source, reproduction walkthrough, raw traces, incident sidebar — available on request: email jon@sandhillscto.com subject “Ollama trial bundle”.

If you’re evaluating an AI model for a real workload in your business and want an independent second opinion on what will actually hold up under load, that’s the kind of evaluation I do with clients. Let’s talk.

Keep reading: The AI Subscription You Bought for Your Business May Not Cover the Tools You’re Actually Using is the policy change that made this trial worth running. The Small Business Case for Local AI Inference runs the math on self-hosting. The OpenClaw Ecosystem in 2026 covers the surrounding tooling.

Published 2026-04-20. Article corrected on publish-day via two audit passes: the first reclassified nemotron-3-super:cloud PASS / Tier 2 from one audit session and excised a 14-minute cold-start previously attributed to glm-5.1 as a harness error on our end. The second expanded nemotron to three sessions for parity with the other eight models’ retest protocol; that expansion surfaced a P3 prioritization inconsistency (2/3 consensus with one live-traffic-first pick) and a P0 self-report drift (three different CLAUDE.md line counts and three different context-window readings across three sessions). Both are now called out in §4 and the Tier summary. See the editor’s note at the top of this article for the full change log.