Anthropic launches Claude Opus 4.5, with top scores in most benchmarks

Anthropic's new model is state of the art in most benchmarks.
Opus 4.5 beats every human candidate on Anthropic’s onboarding exam for engineers. (Picture: Anthropic)
Billing it as the «best model in the world for coding, agents, and computer use,» Opus 4.5 is indeed state-of-the-art in software engineering.

It scores 80.9% in SWE-bench Verified, the preferred benchmark for coding lately. Gemini 3 has 76.2% in this bench, and GPT-5.1-Codex-Max registers at 77.9%.

Short-lived tenure for Gemini
That means Gemini 3’s tenure at the top of the range was short-lived, first getting outperformed by Codex-Max and now by Opus all within a short week.

Anthropic also put Opus 4.5 through their «notoriously difficult» take-home exam that they use on their engineering candidates, and for the first time it scored better than any human ever.

Anthropic’s chosen benchmarks sure look impressive. (Picture: Anthropic)
The model also scores better than most on vision, reasoning and mathematics, according to the benchmarks they published on launch, handily beating the other top models in almost every category.

76% more efficient
Opus 4.5 is also much more efficient, using 76% fewer tokens than Sonnet 4.5 at medium effort, and 48% fewer tokens at the highest level.

There are also some updates to Claude Code, first with availability in the desktop app. The agent also now asks clarifying question before building a plan.md file and executing.

It also automatically summarizes earlier context, so longer conversations are now possible.

API pricing is set at $5 per million input tokens, and $25 per output, cheaper than Opus 4.1, but pricier than GPT-5.1 or Gemini Pro.

Read more: Anthropic’s launch page, writeup at Ars Technica