The AI Desk
TUESDAY, 14 MARCH 2023 From the desk of Amit Singhal Vol. I · The ChatGPT Era
All news RESEARCH

GPT-4 arrives with a paper that says less than its capabilities suggest

OpenAI's flagship model passed the bar exam, scored in the top ten per cent on the LSAT and looked at images. The accompanying technical report declined to say how big it was, what it was trained on, or how the safety work was done.

GPT-4 arrives with a paper that says less than its capabilities suggest

On 14 March 2023, fifteen weeks after ChatGPT's debut, OpenAI released GPT-4. The launch came alongside a 98-page technical report that was conspicuous for the things it did not contain: parameter counts, training data composition, training compute and detailed evaluations beyond the marquee benchmarks. The document explicitly stated that, given the competitive landscape and the safety implications of large-scale models, this paper would contain "no further details" about architecture, hardware, training method or dataset construction.

The capability leap, by contrast, was substantial. As reported by Wired and Ars Technica on the day, GPT-4 cleared the Uniform Bar Examination at roughly the 90th percentile of test-takers, scored in the top decile on the LSAT and the GRE Quantitative section, and could accept images as input alongside text. The image input feature was not made publicly available at launch, but partner Be My Eyes received early access for an accessibility tool.

A closed paper from an open lab

The decision to publish so little drew immediate criticism. Researchers at Stanford's HAI institute and the University of Washington's Sasha Rush, quoted in The Verge, called it the moment OpenAI stopped being a research organisation. Ben Schmidt, an open-source AI advocate, posted on Twitter that the report had "basically no information you could use to reproduce or critique the work." OpenAI's chief scientist Ilya Sutskever, in an interview with The Verge published the same week, said openly that the original open philosophy had been a mistake and that the company expected the field to converge on closed releases.

The era of papers that could be checked by other researchers ended with a 98-page document that asked us to trust the headline numbers.

What was actually shipped

Beyond the benchmarks, GPT-4 was integrated immediately into ChatGPT Plus and into Microsoft's Bing Chat. Within forty-eight hours, the model was available to enterprises through Azure OpenAI Service. Stripe and Duolingo were named as launch partners. The throughput on day one was rate-limited to twenty-five messages every three hours per Plus subscriber, an early acknowledgement that demand was running well ahead of inference capacity.

The model's most consequential property turned out not to be the benchmark scores. It was reliability. Engineers building on GPT-3.5 had spent the previous quarter writing what one team at Stripe described to the Financial Times as "emotional support code" to handle the model's failure modes. GPT-4 followed instructions, refused malformed requests with explanations and held context across longer documents. Production teams that had been pretending to ship AI features began, quietly, to actually ship them.

Originally reported by OpenAI (OpenAI Research) on 14 March 2023. Read the original report →
← Previous
Google's Bard demo gets a single fact wrong, and Alphabet's market cap drops by 100 billion dollars
Next →
An open letter calling for a six-month pause on giant AI experiments collects 30,000 signatures and changes nothing

Discussion

Email used only for your avatar. Never shown, never stored in plain text.