The AI Desk
TUESDAY, 22 OCTOBER 2024 From the desk of Amit Singhal Vol. I · The ChatGPT Era
All news PRODUCTS

Anthropic ships Computer Use and agentic AI moves from demo videos to a feature you can invoke

Claude can now move a mouse and click a button. The reliability is not where it needs to be. The threshold for 'real' was crossed regardless.

Anthropic ships Computer Use and agentic AI moves from demo videos to a feature you can invoke

On 22 October 2024, Anthropic announced Computer Use, a beta feature that allowed Claude 3.5 Sonnet (new) to control a computer interactively, taking screenshots, moving the mouse cursor, clicking buttons and typing text into fields. The feature shipped in Anthropic's API rather than in Claude.ai, and required developers to provide their own virtual machine sandboxing.

The launch demonstration showed Claude conducting a multi-step research task across several websites, filling in a form on a third site with the gathered information, and reporting back. The total task took several minutes. As The Verge and Wired noted in coverage that day, the demo was not the point. The point was that the agentic loop, take-screenshot, decide-action, execute-action, repeat, had moved from research papers to a feature with a price-per-screenshot.

What worked and what did not

The reliability was uneven. Anthropic's own announcement post acknowledged that Computer Use was "experimental" and "at times error-prone". Independent reviewers in the days following found the success rate on novel desktop tasks to be in the thirty-to-fifty per cent range, rising substantially when the task was decomposed into smaller chunks. The OSWorld benchmark, a standard evaluation for desktop agents released earlier in 2024, scored Claude 3.5 Sonnet (new) at 14.9 per cent unaided, well above prior models but well below human baseline.

OSWorld benchmark, October 2024
Pass rate on desktop agent tasks
Human baseline 72.4 % Claude 3.5 (new) + tools 22 % Claude 3.5 (new) 14.9 % GPT-4o 5 % Llama 3.1 405B 1.6 %
Reported scores; substantial variance across benchmark variants.

Despite the reliability gap, deployment was immediate. By the end of November 2024, several enterprise platforms including Replit, Asana and a long tail of automation startups had announced integrations using Computer Use as a primitive. As Bloomberg reported that quarter, internal Anthropic projections of Computer Use revenue exceeded the company's forecasts by a factor of three within the first ninety days.

It crossed the threshold from possible to invocable. The reliability gap was real but it was a reliability gap, not an existence gap.

Computer Use is, in retrospect, the moment the agentic-AI category moved from a research demo into a paid feature. By 2026 it has become a standard capability, with reliability climbing roughly twenty percentage points each quarter on the public benchmarks. The OSWorld scores at the time of writing sit comfortably in the high-fifties.

Originally reported by Anthropic (Anthropic) on 22 October 2024. Read the original report →
← Previous
California's Governor Newsom vetoes SB 1047 and the most consequential US AI bill of the year dies short of his desk
Next →
OpenAI launches a 200-dollar-a-month ChatGPT Pro tier and pricing for premium AI catches up to enterprise software

Discussion

Email used only for your avatar. Never shown, never stored in plain text.