AI

How Gemma 4 and DeepSeek V4 ended the brute-force era of LLM scaling

Susan Hill

The price of running a long-context AI conversation has been falling for two years, but the curve has been gentle. A new wave of open-weight models is making it steep. Gemma 4 from Google DeepMind and DeepSeek V4 are the most visible examples: both are substantially cheaper to operate per token than their predecessors, both built on architectural ideas the industry has been circling for over a year.

The cost of inference, the work a model does every time someone sends it a prompt, has historically grown faster than the context window. A 100,000-token chat does not cost a thousand times what a 100-token chat costs; it costs more like ten thousand times, because the attention mechanism that lets the model see the whole conversation revisits everything on every step. That math is what made long-context AI a luxury feature for closed providers. The new open models are rewriting it.

YouTube video

Three techniques carry most of the weight. Key-value sharing, the most established of the three, lets neighboring layers of the model reuse the same memory of past tokens instead of each storing a separate copy. Multi-head compression squeezes the attention heads themselves into a smaller representation without losing what they pay attention to. Compressed attention summarizes older parts of the conversation rather than carrying them in full — so a model talking about page one of a long document does not have to remember page one in its original resolution.

The numbers behind the shift are easier to read in dollars than in flops. Gemma 4 at its 27B-parameter tier serves a million-token conversation for roughly a quarter of what Gemma 3 cost on the same hardware, and DeepSeek V4 has pushed its 1M-token context down to a per-call price competitive with the 200K-token windows of last year. These are not marketing numbers from a launch keynote. They come from independent benchmarks against open-weight checkpoints, and the local-AI community is already reproducing them on consumer-grade Strix Halo boxes and RTX 5090 rigs that would have struggled to even load the older versions.

None of these techniques is free. Each one trades quality for efficiency, and the open-weight teams have been clear about where the trade-offs hurt. Compressed attention can drop precision on long-range factual recall: the kind of test where you ask a model what was on page 12 of a 200-page contract. KV sharing can damage fine-grained reasoning where neighboring layers actually needed different views of the same token. Gemma 4 and DeepSeek V4 land where they do because they combine the techniques carefully and tune them for the workloads their developers actually care about.

The clearest map of the territory comes from Sebastian Raschka, whose recent analysis walks through the architectural details of half a dozen recent open models and pulls out the pattern. His piece is the kind of synthesis that turns scattered engineering choices into a story about where the field is going. The story he ends up telling is straightforward: the cheap-context era is here, and it arrived through architecture rather than through raw scale.

The pressure this puts on the closed-source incumbents is the part worth watching. Anthropic, OpenAI, Meta and xAI are unlikely to be unaware of the same architectural moves; what is unknown is whether they have already shipped equivalents under their hood and have simply chosen not to talk about them. Public APIs from the closed labs have been getting cheaper too, but not at the same slope. The risk for the closed side is not that the open models catch up on raw capability — they often do not — but that they make the closed labs’ premium tier feel optional for an increasingly large set of use cases. When the same job can be done on a four-thousand-dollar desktop with Gemma 4 as on a two-hundred-dollar-a-month enterprise API, the API’s price card stops being a self-evident proposition.

For anyone running a model on their own hardware — the local-AI crowd, small companies that cannot pay enterprise API rates, researchers without cloud budgets — workloads that were out of reach a year ago are now on the table. A million-token context window is no longer the exclusive province of the most expensive commercial APIs. If open models can match long-context performance at a fraction of the operating cost, the price of the closed alternatives has to come down too.

The downstream effects ripple into the rest of the AI infrastructure stack. Vector database vendors and retrieval-augmented generation startups built their business cases on the assumption that long context would always be expensive enough to make external retrieval the obvious alternative. As the cost curve flips, plain in-context reading of a two-hundred-page contract or a year of email becomes architecturally simpler than chunk-embed-retrieve. GPU vendors face the inverse problem: their margin on memory-bound workloads softens as KV sharing and compressed attention drive down VRAM requirements per query. The fastest-moving teams are already designing for an environment in which long context is cheap and abundant, rather than treating it as a luxury feature to gate behind enterprise pricing.

The next batch of open releases will test how far the trend can travel. Several research groups have hinted at architectures that combine all three techniques with sparse attention, the next obvious step, and at least two papers expected this summer are likely to push the cost curve down again. Whether the closed labs respond by publishing their own methods or by quietly adopting the open ones is the question worth watching.

Tags: , , , , , , ,

Discussion

There are 0 comments.