AI Research News - The AI Desk

RESEARCH

Anthropic ships Claude 4.7 with a one-million-token context window for Sonnet and Opus

A million tokens of context across the Sonnet and Opus tiers, agentic-task improvements, and the first publicly-available frontier model that can hold an entire mid-sized codebase in working memory.

16 APR 2026 CNBC 7 comments · 12 likes

RESEARCH

Anthropic releases Claude 4.5 Sonnet and the working frontier shifts again

An incremental version-bump in the naming, a non-incremental capability shift in the substance. Claude 4.5 Sonnet's coding-benchmark numbers extended Anthropic's lead on agentic software work.

29 SEPT 2025 Anthropic 7 comments · 13 likes

RESEARCH

Anthropic ships Claude 4 Opus and Sonnet, and the coding-benchmark frontier moves a non-trivial amount

A two-model release, headline numbers on SWE-bench Verified above seventy per cent, and a Claude Code product that turned the model into a developer-tool primitive.

22 MAY 2025 Anthropic 7 comments · 30 likes

RESEARCH

Anthropic introduces Claude 3.7 Sonnet and reasoning stops being a separate product tier

Hybrid-thinking is the design choice that mattered. The model can extend its reasoning at inference time on a setting, and the field's product structure quietly resets.

26 FEB 2025 Anthropic 6 comments · 27 likes

RESEARCH

OpenAI's o3 announcement clears the ARC-AGI benchmark and the field's headline metric is briefly retired

A model not yet shipping, a benchmark designed to be hard, and a per-query inference cost that became the most-discussed number in AI research that month.

20 DEC 2024 OpenAI 7 comments · 34 likes

RESEARCH

OpenAI launches o1 and reasoning becomes a paid product category

A new model family with chain-of-thought reasoning baked in at inference time, priced as a separate tier, and benchmarked on tests no large language model had previously cleared.

12 SEPT 2024 OpenAI 7 comments · 9 likes

RESEARCH

Meta releases Llama 3.1 405B and the open-weight frontier sits, briefly, at parity

An almost half-trillion-parameter model with downloadable weights, benchmark numbers competitive with GPT-4o and Claude 3.5, and a 92-page paper that left almost nothing about the recipe to the reader's imagination.

23 JUL 2024 Meta AI 7 comments · 15 likes

RESEARCH

Meta releases Llama 3 and the open-weight thesis gets its strongest evidence yet

Two model sizes, a permissive licence, and benchmarks competitive with closed-API leaders from a year earlier. Meta's bet on open-weights stops looking eccentric.

18 APR 2024 Meta 7 comments · 9 likes

RESEARCH

Anthropic ships the Claude 3 family and reclaims a credible frontier position

Three models, one set of benchmarks, and the first sustained challenge to OpenAI's perceived capability lead since GPT-4. The most consequential model in the line-up turned out not to be the headline one.

4 MAR 2024 Anthropic 7 comments · 11 likes

RESEARCH

GPT-4 arrives with a paper that says less than its capabilities suggest

OpenAI's flagship model passed the bar exam, scored in the top ten per cent on the LSAT and looked at images. The accompanying technical report declined to say how big it was, what it was trained on, or how the safety work was done.

14 MAR 2023 OpenAI 10 comments · 13 likes

Research AI news

Anthropic ships Claude 4.7 with a one-million-token context window for Sonnet and Opus

Anthropic releases Claude 4.5 Sonnet and the working frontier shifts again

Anthropic ships Claude 4 Opus and Sonnet, and the coding-benchmark frontier moves a non-trivial amount

Anthropic introduces Claude 3.7 Sonnet and reasoning stops being a separate product tier

OpenAI's o3 announcement clears the ARC-AGI benchmark and the field's headline metric is briefly retired

OpenAI launches o1 and reasoning becomes a paid product category

Meta releases Llama 3.1 405B and the open-weight frontier sits, briefly, at parity

Meta releases Llama 3 and the open-weight thesis gets its strongest evidence yet

Anthropic ships the Claude 3 family and reclaims a credible frontier position

GPT-4 arrives with a paper that says less than its capabilities suggest