Anthropic ships Claude 4 Opus and Sonnet, and the coding-benchmark frontier moves a non-trivial amount
A two-model release, headline numbers on SWE-bench Verified above seventy per cent, and a Claude Code product that turned the model into a developer-tool primitive.

On 22 May 2025, Anthropic released Claude 4 in two sizes, Opus and Sonnet. Both models shipped with hybrid thinking inherited from Claude 3.7. The headline benchmark was SWE-bench Verified, the agentic-coding evaluation that had become the field's most-watched coding benchmark. Claude 4 Opus scored 72.5 per cent. Claude 4 Sonnet scored 72.7 per cent. The previous Anthropic flagship, Claude 3.7 Sonnet, had topped out at 70.3 per cent in extended-thinking mode.
The simultaneous launch of Claude Code, an Anthropic-built terminal-native developer agent that ran Claude 4 against a developer's local repository, was, according to coverage in The Information and Wired, the more consequential commercial announcement. Claude Code shipped with file-system tools, a sandboxed shell, and a Git integration. The pricing, at twenty US dollars per developer per month for the Pro tier and one hundred for the Max tier, undercut several of the agentic-coding tools that had built on the API.
Coding benchmarks, again
The downstream effect on developer tooling was visible within weeks. Cursor, Continue, Aider, and the long tail of agentic coding tools all updated their model defaults to Claude 4 within thirty days. Microsoft's GitHub Copilot business added Claude as a selectable model the following month. By August, internal usage data reported by Anthropic in commentary around its Series F suggested that more than half of code-generation API calls across the company's enterprise base were coming from agentic-coding tools rather than chat interfaces.
Two-thirds of SWE-bench. Three years ago that was a research-paper headline. Now it is a Tuesday.
The Sonnet-Opus pricing relationship had also shifted. Sonnet at three dollars per million input tokens was, on most coding benchmarks, within a percentage point of Opus at fifteen dollars. The price-performance argument now ran almost entirely in favour of Sonnet, and Opus became, by mid-2025, primarily a long-horizon-research model rather than a working tool.



Discussion