
It scores 80.9% in SWE-bench Verified, the preferred benchmark for coding lately. Gemini 3 has 76.2% in this bench, and GPT-5.1-Codex-Max registers at 77.9%.
Short-lived tenure for Gemini
That means Gemini 3’s tenure at the top of the range was short-lived, first getting outperformed by Codex-Max and now by Opus all within a short week.
Anthropic also put Opus 4.5 through their «notoriously difficult» take-home exam that they use on their engineering candidates, and for the first time it scored better than any human ever.

76% more efficient
Opus 4.5 is also much more efficient, using 76% fewer tokens than Sonnet 4.5 at medium effort, and 48% fewer tokens at the highest level.
There are also some updates to Claude Code, first with availability in the desktop app. The agent also now asks clarifying question before building a plan.md file and executing.
It also automatically summarizes earlier context, so longer conversations are now possible.
API pricing is set at $5 per million input tokens, and $25 per output, cheaper than Opus 4.1, but pricier than GPT-5.1 or Gemini Pro.
Read more: Anthropic’s launch page, writeup at Ars Technica