AI

Two points behind Opus 4.6, five times cheaper: Gemini 3.5 Flash redraws the math

Susan Hill

Google shipped Gemini 3.5 Flash on Monday at $1.50 per million input tokens and $9 per million output. The new model sustains over 280 output tokens per second, keeps the same one-million-token context window as its predecessor, and lands on the Artificial Analysis Intelligence Index at 55, nine points above Gemini 3 Flash. By Tuesday morning an r/Anthropic thread had pulled the chart side-by-side with Claude Opus 4.6 and asked the question this market has been circling for six months: at what point does a two-point benchmark edge stop being worth a five-times price tag?

The Intelligence Index aggregates a basket of public evaluations across reasoning, knowledge, coding, math and agentic task completion into a single score from 1 to 100. Claude Opus 4.6 in adaptive reasoning mode sits at 57. Gemini 3.5 Flash, released May 19, sits at 55. The nine-point version-over-version gain is the steepest single-step improvement Flash has ever recorded, large enough that the new model now matches Anthropic’s last-generation Sonnet on raw intelligence at a fraction of Sonnet’s cost.

The “smarter” framing the Reddit thread used overstates the gap in Flash’s favour. On the underlying Intelligence Index, Opus 4.6 is still ahead by two points. The chart that broke the thread is not the Intelligence Index in isolation. It is the intelligence-efficiency-versus-cost view, where the axis itself is doing different work, and where Flash 3.5 does not just beat Opus 4.6. It sits in a class with nothing else in the same neighbourhood.

Opus 4.6 charges around $6.25 per million input tokens and $25 per million output. Flash charges $1.50 and $9. For a chat workload weighted two-to-one in output, the effective price ratio sits closer to 4.5x than the round “five-times” the thread headlined. The rounding is fair. Speed makes the picture worse for the flagship: Flash 3.5 sustains over 280 output tokens per second, Opus 4.6 in max-effort reasoning mode runs at roughly a tenth of that pace on the same benchmark suite. For products where a user is staring at a cursor — coding assistants, support agents, anything interactive — latency is a feature the price tag does not buy back.

A year ago the case for buying the most expensive model was a one-line argument. The quality lift to the next tier was steep enough that the price differential was a rounding error against the value delivered. The chart the Reddit thread pasted is a different chart. The marginal cost of the last two intelligence points has become the entire pricing decision for production loads, and the rounding error now lands closer to $4.75 of every six dollars spent.

There is a clean argument for keeping Opus 4.6 in a stack. Long-context reasoning across hundreds of pages, agent loops where errors compound across steps, document analysis where a two-point gap on a basket benchmark hides much larger task-specific edges. Opus is still the model engineers reach for when the failure mode is “the answer was wrong,” not “the answer was late.” The share of production workloads that look like that is shrinking. It is not zero, and it is the slice where the $25-per-million pricing earns its keep.

The chat-style turns that drive most billable tokens — drafting, summarising, classifying, translating, code completion, customer-facing reasoning — all live in Flash’s reach. The decision engineering teams run every quarter is no longer “which model is best.” It is “which model is best per dollar at acceptable latency.” That second decision Flash now wins by a margin that does not require subtlety to interpret.

The Reddit thread’s secondary framing, that consensus everywhere is that Opus 4.6 is better than 4.7, deserves a softer treatment. It is anecdotal. Anthropic’s two most recent Opus releases have drawn split reviews on coding evaluations and tool-use rigor, with some teams reporting regressions on long-running agent loops in 4.7 and others reporting clean wins on identical workloads. Both observations can be true at once when behaviour is being tuned across many axes between minor versions. The two models also benchmark within a point of each other on the public index, so the community split tracks closer to taste than to capability. What is not in dispute is that the price of either Opus does not budge.

The deeper signal in the Reddit conversation is what users were not arguing about. Nobody in the thread defended Opus’s pricing on first principles. The defences offered were workload-specific. Opus still wins on this agent loop. Opus stays for our document-review pipeline. Those are real, but they are workload defences, not flagship defences. A flagship is supposed to win on the spread, not on a specific lane.

Two-point intelligence gap. Five-times price tag. Six-times speed advantage in the other direction. A one-million-token context window at $1.50 per million input. Multimodal in, agentic-task Elo over 1650, ninety-percent discount on cached input. Anthropic’s response in the next quarter will tell its own story. The harder argument to write, in May 2026, is the one a salesperson has to walk into a customer meeting carrying.

Discussion

There are 0 comments.