AI

Claude Opus 4.8 catches four times more of its own coding mistakes

Susan Hill

Anthropic has upgraded its most capable model to Claude Opus 4.8, and the headline change is not a bigger brain but a more cautious one. The company says the model is roughly four times less likely than the version it replaces to let flaws in its own code pass unremarked, and that it is more willing to flag the parts of a task it is unsure about. For people who hand off real work to an AI, whether that means writing code, running an analysis, or operating a computer, that reliability is the spec that actually counts.

The failure mode of today’s AI agents is not stupidity but confidence. They produce output that looks finished and reads cleanly while quietly carrying mistakes, and a system left to run on its own tends to build the next step on top of the last error. Hand an agent a multi-step job and a single wrong assumption early on can propagate through everything that follows, so the work arrives looking complete and turns out to be subtly broken. A model that surfaces its own doubt, rather than papering over it, is easier to supervise, because a person knows where to look.

That is the threshold Anthropic is aiming at, and it frames Opus 4.8 less as a smarter system than as a steadier one. The company describes sharper judgment on agentic tasks, where the model is acting in steps rather than answering a single question, and a stronger habit of saying when it cannot be sure. For everyday use that is a more useful gain than a fractional improvement on a public test, because most of the cost of working with an AI is the time spent checking whether it quietly got something wrong.

The clearest evidence is in coding. Anthropic reports that Opus 4.8 lets far fewer flaws in the code it writes slip past without comment, the kind of silent bug that surfaces in production rather than in review. The investment firm Bridgewater Associates, an early tester, said the model pointed out problems with both the inputs and the outputs of an analysis on its own, something it found other systems routinely missed. In knowledge work and finance, the dangerous error is precisely the one nobody catches in time, and a tool that raises its hand changes how much a person has to double-check.

The benchmark numbers back the framing without being the story. Opus 4.8 reportedly scored 69.2 percent on SWE-Bench Pro, a test built from real software-engineering tasks, placing it ahead of OpenAI’s GPT-5.5 and Google’s Gemini 3.1 Pro. On Anthropic’s own measures it beats every previous Opus model on a coding benchmark at each effort setting and set the company’s highest recorded result on a legal-reasoning test. The leads are real but narrow, and benchmark wins have a poor record of predicting how a model behaves once it is doing unglamorous work all day.

There are new tools to go with the model. A research-preview feature in Claude Code, called dynamic workflows, lets Opus plan a large job and then run hundreds of parallel subagents in a single session, aimed at migrations that span hundreds of thousands of lines of code and using a project’s existing test suite as the bar for success. Separately, a new control in Claude.ai and the company’s Cowork environment lets users dial how much effort, and how many tokens, the model spends on a given reply, trading speed and cost against depth.

The caveats sit close to the claims. The reliability gains rest largely on Anthropic’s own testing, and a figure like four times less likely is an internal measurement rather than an independently audited one. Honesty is also hard to verify from the outside, because a model can announce its uncertainty and still be wrong, or raise a flag on the wrong thing entirely. Dynamic workflows arrives only as a preview rather than a finished feature, and the speed story is less generous than it sounds, since the faster mode costs double the standard rate and is called cheaper only against earlier premium pricing.

For users weighing the cost, standard access holds at five dollars per million input tokens and twenty-five per million output, the same as the previous Opus. The faster mode runs at roughly two and a half times the speed for ten and fifty dollars per million, which makes the new effort control as much a budgeting tool as a quality dial.

Claude Opus 4.8 is available now through Anthropic’s developer API under the name claude-opus-4-8, and the company says it is rolling out everywhere on the same day. It arrived on Thursday, roughly six weeks after Opus 4.7, an unusually short gap that followed a muted reception for that version and a run of competing launches from OpenAI and Google. The real test is whether a model trained to doubt itself proves more useful in daily work than one trained to shine on a leaderboard, and that verdict will come from the agents people actually let run.

Discussion

There are 0 comments.