OpenAI's o3 announcement clears the ARC-AGI benchmark and the field's headline metric is briefly retired
A model not yet shipping, a benchmark designed to be hard, and a per-query inference cost that became the most-discussed number in AI research that month.

On 20 December 2024, day twelve of OpenAI's Shipmas series, the company announced rather than shipped o3, a successor to the o1 reasoning family. The headline result was on the ARC-AGI benchmark, a deliberately abstract puzzle benchmark designed by François Chollet in 2019 to be resistant to memorisation. ARC-AGI scores had inched upward across the field but had not crossed the human-baseline ceiling. o3 cleared the public ARC-AGI semi-private set at 87.5 per cent. The human-baseline reference was 84 per cent.
Chollet, in a public statement on the day, called the result "genuine progress" and noted that the benchmark would need replacing. ARC-AGI 2 was released in March 2025. As reported by Wired and MIT Technology Review the same week, the o3 result was the first widely accepted instance of a closed-AI benchmark designed to be hard being cleared by a system that had not been trained on it.
The cost-per-query asterisk
The result came with a notable footnote. The high-compute o3 setting that scored 87.5 per cent had cost, by OpenAI's own disclosure, more than three thousand US dollars per benchmark task. Across the eighty-task semi-private benchmark, that translated to roughly a quarter of a million dollars in inference compute. The low-compute setting, which scored 75.7 per cent, used substantially less but still considerably more than commodity inference.
The cost number was the more substantively significant disclosure. As The Information's analysis published the week of New Year's Eve made clear, o3 was the first time a major AI lab had publicly priced inference at multiple thousands of dollars per individual query. The pricing sat above the per-query cost of most professional services billed at hundreds of dollars an hour, which was, by 2025, the cohort against which AI was increasingly commercially benchmarked.
Eighty-seven per cent on a benchmark designed to be hard. The footnote was more interesting than the number.
o3 became generally available in early 2025 at substantially lower prices than the announcement-time figures, but with capability tier between the announcement-time low- and high-compute settings. By the second half of 2025 the per-query cost discussion had become routine: enterprise-AI procurement had begun to budget per-query inference rather than per-month subscription fees. The December 2024 announcement is now read as the moment that conversation moved to the surface.



Discussion