Research AI news
AI research news, model releases, papers, and the capability shifts worth tracking.

Anthropic ships Claude 4.7 with a one-million-token context window for Sonnet and Opus
A million tokens of context across the Sonnet and Opus tiers, agentic-task improvements, and the first publicly-available frontier model that can hold an entire mid-sized codebase in working memory.

Anthropic releases Claude 4.5 Sonnet and the working frontier shifts again
An incremental version-bump in the naming, a non-incremental capability shift in the substance. Claude 4.5 Sonnet's coding-benchmark numbers extended Anthropic's lead on agentic software work.

Anthropic ships Claude 4 Opus and Sonnet, and the coding-benchmark frontier moves a non-trivial amount
A two-model release, headline numbers on SWE-bench Verified above seventy per cent, and a Claude Code product that turned the model into a developer-tool primitive.

Anthropic introduces Claude 3.7 Sonnet and reasoning stops being a separate product tier
Hybrid-thinking is the design choice that mattered. The model can extend its reasoning at inference time on a setting, and the field's product structure quietly resets.

OpenAI's o3 announcement clears the ARC-AGI benchmark and the field's headline metric is briefly retired
A model not yet shipping, a benchmark designed to be hard, and a per-query inference cost that became the most-discussed number in AI research that month.

OpenAI launches o1 and reasoning becomes a paid product category
A new model family with chain-of-thought reasoning baked in at inference time, priced as a separate tier, and benchmarked on tests no large language model had previously cleared.

Meta releases Llama 3.1 405B and the open-weight frontier sits, briefly, at parity
An almost half-trillion-parameter model with downloadable weights, benchmark numbers competitive with GPT-4o and Claude 3.5, and a 92-page paper that left almost nothing about the recipe to the reader's imagination.

Meta releases Llama 3 and the open-weight thesis gets its strongest evidence yet
Two model sizes, a permissive licence, and benchmarks competitive with closed-API leaders from a year earlier. Meta's bet on open-weights stops looking eccentric.

Anthropic ships the Claude 3 family and reclaims a credible frontier position
Three models, one set of benchmarks, and the first sustained challenge to OpenAI's perceived capability lead since GPT-4. The most consequential model in the line-up turned out not to be the headline one.

GPT-4 arrives with a paper that says less than its capabilities suggest
OpenAI's flagship model passed the bar exam, scored in the top ten per cent on the LSAT and looked at images. The accompanying technical report declined to say how big it was, what it was trained on, or how the safety work was done.