All news RESEARCH

Anthropic ships Claude 4 Opus and Sonnet, and the coding-benchmark frontier moves a non-trivial amount

A two-model release, headline numbers on SWE-bench Verified above seventy per cent, and a Claude Code product that turned the model into a developer-tool primitive.

THURSDAY, 22 MAY 2025 By The AI Desk

Anthropic ships Claude 4 Opus and Sonnet, and the coding-benchmark frontier moves a non-trivial amount

On 22 May 2025, Anthropic released Claude 4 in two sizes, Opus and Sonnet. Both models shipped with hybrid thinking inherited from Claude 3.7. The headline benchmark was SWE-bench Verified, the agentic-coding evaluation that had become the field's most-watched coding benchmark. Claude 4 Opus scored 72.5 per cent. Claude 4 Sonnet scored 72.7 per cent. The previous Anthropic flagship, Claude 3.7 Sonnet, had topped out at 70.3 per cent in extended-thinking mode.

The simultaneous launch of Claude Code, an Anthropic-built terminal-native developer agent that ran Claude 4 against a developer's local repository, was, according to coverage in The Information and Wired, the more consequential commercial announcement. Claude Code shipped with file-system tools, a sandboxed shell, and a Git integration. The pricing, at twenty US dollars per developer per month for the Pro tier and one hundred for the Max tier, undercut several of the agentic-coding tools that had built on the API.

Coding benchmarks, again

SWE-bench Verified, May 2025

Real-world software engineering, percent solved

Public reporting at launch. SWE-bench Verified is a deliberately-sampled subset of SWE-bench.

The downstream effect on developer tooling was visible within weeks. Cursor, Continue, Aider, and the long tail of agentic coding tools all updated their model defaults to Claude 4 within thirty days. Microsoft's GitHub Copilot business added Claude as a selectable model the following month. By August, internal usage data reported by Anthropic in commentary around its Series F suggested that more than half of code-generation API calls across the company's enterprise base were coming from agentic-coding tools rather than chat interfaces.

Two-thirds of SWE-bench. Three years ago that was a research-paper headline. Now it is a Tuesday.

The Sonnet-Opus pricing relationship had also shifted. Sonnet at three dollars per million input tokens was, on most coding benchmarks, within a percentage point of Opus at fifteen dollars. The price-performance argument now ran almost entirely in favour of Sonnet, and Opus became, by mid-2025, primarily a long-horizon-research model rather than a working tool.

Originally reported by Anthropic (Anthropic) on 22 May 2025. Read the original report →

← Previous

OpenAI releases GPT-4.1 and the model-naming chapter starts to look ridiculous

Perplexity ships Comet and the agentic browser becomes a category

Anthropic ships Claude 4 Opus and Sonnet, and the coding-benchmark frontier moves a non-trivial amount

Coding benchmarks, again

Discussion

The AI Desk, in your inbox.

More from RESEARCH