Back to Blog
Anthropic

Claude Opus 4.8 Released: SWE-Bench Pro 69.2%, 10.6-Point Lead Over GPT-5.5 — Anthropic's New Standard for Agent Quality

Claude Opus 4.8

On May 28, 2026, Anthropic officially released Claude Opus 4.8, the latest iteration of its flagship model. Arriving just six weeks after Opus 4.7 at an unchanged price point, it outperforms both OpenAI's GPT-5.5 and Google's Gemini 3.1 Pro across agentic coding, reasoning, and computer-use benchmarks.

The Numbers That Matter

Agentic Coding: SWE-Bench Pro

ModelScore
Claude Opus 4.869.2%
Claude Opus 4.764.3%
GPT-5.558.6%
Gemini 3.1 Pro54.2%

A 4.9-point improvement over the previous model and a 10.6-point lead over GPT-5.5. SWE-Bench Pro measures real software engineering tasks — not code generation in isolation, but the ability to autonomously solve engineering problems as an agent.

Multidisciplinary Reasoning: Humanity's Last Exam

ModelWithout ToolsWith Tools
Claude Opus 4.849.8%57.9%
GPT-5.552.2%
Gemini 3.1 Pro51.4%

A 5.7-point advantage over GPT-5.5 with tools enabled on this cross-disciplinary benchmark spanning mathematics, science, and humanities.

Computer Use: OSWorld-Verified

ModelScore
Claude Opus 4.883.4%
Claude Opus 4.782.3%
GPT-5.578.7%
Gemini 3.1 Pro76.2%

The browser agent component scored 84% on Online-Mind2Web, surpassing both Opus 4.7 and GPT-5.5.

Knowledge Work: GDPval-AA

ModelScore
Claude Opus 4.81890
GPT-5.51769
Claude Opus 4.71753

The One Reversal: Terminal-Bench 2.1

ModelScore
GPT-5.578.2%
Claude Opus 4.874.6%
Gemini 3.1 Pro70.3%
Claude Opus 4.766.1%

GPT-5.5 leads by 3.6 points in terminal coding. However, Opus 4.8 improved 8.5 points over 4.7, and the gap is narrowing.

Benchmark Comparison

4x More Accurate: The Code Quality Revolution

Beyond benchmark numbers, the qualitative shift in code quality is striking. Anthropic reports that Opus 4.8 is "approximately four times less likely than Opus 4.7 to let flaws in generated code pass without comment."

This represents a leap in self-verification capability:

  • Significantly increased frequency of flagging bugs in its own code
  • Stronger tendency to challenge problematic plans
  • Improved ability to recognize and communicate uncertainty
  • Reduced frequency of making unsupported claims

Cursor CEO Michael Truell noted: "Opus 4.8 exceeds prior Opus models across every effort level on CursorBench. Tool calling is meaningfully more efficient, using fewer steps for the same intelligence."

Cognition CEO Scott Wu added: "It fixes the comment-verbosity and tool-calling issues from Opus 4.7. Tool use is cleaner."

Pricing and Fast Mode

ModelInputOutput
Claude Opus 4.8 (Standard)$5/MTok$25/MTok
Claude Opus 4.8 (Fast Mode)$10/MTok$50/MTok
Claude Opus 4.7$5/MTok$25/MTok

Standard pricing is identical to Opus 4.7. Fast Mode delivers 2.5x the speed at one-third the cost of previous Fast Mode pricing. Switching is instant via the /fast command in Claude Code.

Databricks CTO Hanlin Tang reported real-world cost reductions of 61% compared to Opus 4.7 — not just from Fast Mode pricing, but from improved agent efficiency reducing total token consumption.

Features Overview

Dynamic Workflows: Hundreds of Parallel Subagents

Launching alongside Opus 4.8, Dynamic Workflows is a new Claude Code feature available as a research preview for Enterprise, Team, and Max plans.

The feature enables Claude to plan work and execute hundreds of parallel subagents within a single session. Key capabilities:

  • Codebase-scale migrations: End-to-end changes across hundreds of thousands of lines of code, from planning to merge
  • Automatic output verification: Each subagent's output is verified before reporting
  • Test suite as quality bar: Existing test suites serve as the quality standard

Shopify Staff Engineer Tom Pritchard observed: "It has noticeably better judgment. It builds up confidence around complex, multi-service explorations."

Messages API Update

A significant change for developers: system entries can now be placed inside the messages array. This allows updating Claude's instructions mid-task without breaking the prompt cache or routing through a user turn — enabling dynamic permission changes, token budget adjustments, and environment context updates during long-running agent tasks.

Alignment Improvements

ModelMisalignment Score
Claude Opus 4.8~1.83
Claude Opus 4.72.47

Lower is better. Opus 4.8's score of 1.83 matches Mythos Preview, Anthropic's most aligned model. This composite score measures supportive behavior, respect for user autonomy, and prosocial traits — critical for agents operating without constant human oversight.

Ecosystem Validation

CompanyAssessment
DatabricksStep change in agentic reasoning. 61% token cost reduction on Genie agent
Thomson ReutersImproved consistency and reasoning in CoCounsel Legal workflows
HebbiaBetter citation precision and token efficiency on dense financial filings
CursorExceeds all prior Opus models at every effort level on CursorBench
ShopifyBetter judgment in complex multi-service explorations

Effort Control and What's Next

Claude.ai and Cowork now feature an "Effort Control" slider alongside the model selector. Higher effort means deeper reasoning and better responses; lower effort means faster responses with less rate limit consumption. Available on all plans.

Anthropic also disclosed details about its next-generation model, Claude Mythos (Project Glasswing). This model class exceeds Opus-level intelligence and is currently in limited testing with Amazon, Microsoft, and Apple for cybersecurity applications. General availability is expected "in the coming weeks" after meeting stronger safety benchmarks.

Conclusion

Claude Opus 4.8 is Anthropic's answer to the question: "Can we trust this agent?" SWE-Bench Pro 69.2%, OSWorld-Verified 83.4%, Humanity's Last Exam 57.9% — all surpassing GPT-5.5 with substantial improvement margins.

Yet GPT-5.5's lead on Terminal-Bench 2.1 shows that dominance in specific domains remains contested. In this rapidly evolving competitive landscape, Opus 4.8 is the strongest contender as of May 2026, but that position is not permanent.

The true differentiator lies in the ability to orchestrate hundreds of subagents (Dynamic Workflows) and the alignment score of 1.83 (safe, reliable autonomous behavior). From a model that executes individual tasks accurately to an agent that autonomously drives complex projects — Opus 4.8 marks a clear inflection point.

Agent Evolution


Comments (0)

Share:XHatena

Post a Comment

Loading...