Anthropic2026-07-13

Anthropic Releases Claude Opus 4.8: Same Pricing, Modest Gains in Coding & Agentic Tasks

On May 28, 2026, Anthropic launched a new version of its flagship model, Claude Opus 4.8. It’s a measured but clear-cut iteration, offering small, broad improvements over Opus 4.7 in coding, agentic tasks, reasoning, and knowledge work benchmarks—while keeping pricing unchanged. Anthropic framed it directly as “a modest but tangible improvement” and heavily emphasized “honesty” as the standout upgrade.

Notably, the update arrived just 41 days after Opus 4.7, making it the fastest iteration in the Opus series so far. The model ID is claude-opus-4-8 and it supports a context of 1 million input tokens and 128,000 output tokens.

Benchmark Breakdown: Clear Coding Gains, but Trails GPT-5.5 in Terminal Tasks

Looking at the benchmarks Anthropic shared in its System Card, the gains are uneven but meaningful in key areas.

In coding, the most significant improvement appears on the harder SWE-bench Pro: Opus 4.8 scores 69.2%, up 4.9 points from Opus 4.7’s 64.3%. By comparison, the more saturated SWE-bench Verified inched from 87.6% to 88.6%, and SWE-bench Multilingual rose from 80.5% to 84.4%. In short, where models were already near the ceiling, gains are minimal; the real progress is on harder, less saturated tasks—a better signal of true capability.

The largest single benchmark jump came in the agentic terminal task Terminal-Bench 2.1, which leapt from 66.1% to 74.6%—an 8.5-point gain. However, Anthropic acknowledges a caveat: even with that improvement, Opus 4.8 still trails GPT-5.5. Under the same public Terminus-2 harness, GPT-5.5 scores 78.2%; using GPT-5.5’s own Codex CLI harness, it reaches 83.4%. The takeaway is clear: if your workflow lives in the terminal or CLI, the top-ranked model overall may not be the best fit for you.

Reasoning benchmarks show a split personality. Most strikingly, USAMO 2026 math proofs surged from 69.3% to 96.7%—a 27.4-point leap in a single version, suggesting a qualitative shift in deep mathematical reasoning. Meanwhile, GPQA Diamond saw a slight dip, dropping from 94.2% to 93.6%, and Humanity’s Last Exam (with tools) improved from 54.7% to 57.9%.

In knowledge work, Opus 4.8 leads with 1890 Elo in Artificial Analysis’s GDPval-AA, a 137-point increase from the prior generation and well above GPT-5.5’s 1769. Computer use (OSWorld-Verified) scored 83.4%, and browser agent performance (Online-Mind2Web) hit 84%. Overall, Opus 4.8 wins 6 out of 7 benchmarks in Anthropic’s comparison, with Terminal-Bench 2.1 as the sole exception.

“Honesty” Is the Headline Feature

While benchmark numbers show gradual progress, Anthropic devoted much of its announcement to the model’s enhanced honesty—specifically, its reduced tendency to make assertions it cannot back up.

A common failure mode in AI models is confidently declaring a task complete or progress made without sufficient evidence. Anthropic says Opus 4.8 is more likely to flag uncertainties in its own work and less prone to presenting unsupported conclusions.

In quantifiable terms: Opus 4.8 is about 4× less likely to let defects slide unannounced in its own code compared to its predecessor. Early testers corroborate this, noting the model proactively points out issues in inputs and outputs—a step other models often leave for users to catch.

For professional workflows like code review, financial analysis, or legal work, this improvement may matter more than any single benchmark score. A model that says “I’m not sure here” is far more reliable in long-cycle, unsupervised agentic workflows than one that scores higher but fails confidently.

Alignment: Misbehavior Rate Approaches Mythos Levels

In pre-launch alignment evaluations, Anthropic’s team concluded Opus 4.8 “sets a new high in measures of prosocial traits, such as supporting user autonomy and acting in users’ best interests.”

More concretely: Opus 4.8’s rate of misaligned behaviors—like deception or aiding abuse—is significantly lower than Opus 4.7’s and now approaches that of Anthropic’s best-aligned model, Claude Mythos Preview. Full alignment assessments and pre-deployment safety tests are included in the Opus 4.8 System Card.

Three Accompanying Feature Updates

Alongside the model, Anthropic launched three features—two of which directly address feedback that Opus 4.7 often “thought” too long.

Effort Control: A new slider in claude.ai and Cowork lets users manually choose how much compute and tokens Claude invests in a task. Opus 4.8 defaults to “high” (similar token usage to Opus 4.7’s default but better results). Users can select “extra” (called xhigh in Claude Code) or “max” for more tokens in exchange for better output. Anthropic recommends “extra” for hard tasks and long async workflows, and has increased rate limits for Claude Code accordingly.
Dynamic Workflows (Research Preview): Available for Claude Code Enterprise, Team, and Max plans, it lets Claude plan a task, run hundreds of sub-agents in parallel within a single session, and self-verify outputs before reporting. Anthropic showcases this by claiming Claude Code with Opus 4.8 can handle repo-wide migrations across hundreds of thousands of lines of code, using existing test suites as acceptance criteria.
Messages API Update: Developers can now insert system entries inside the messages array. This allows mid-task instruction updates—like adjusting permissions, token budgets, or environment context—without breaking prompt caches or simulating a user turn.

Pricing & Availability: Unchanged, with a Cheaper Fast Mode

Opus 4.8 is available today across all platforms. Pricing remains identical to Opus 4.7: $5 per million input tokens, $25 per million output tokens. Developers can access it via the Claude API using the ID claude-opus-4-8.

The change is in Fast mode: It runs at roughly 2.5× the speed and costs $10/$50 per million input/output tokens (2× the regular price). But compared to the Fast mode of previous Claude models, it’s 3× cheaper per token. This aligns with the broader industry trend of lowering inference costs while boosting capability, significantly impacting high-volume, cost-sensitive use cases.

What’s Next: Mythos-Level Models for All Customers Soon

Anthropic also reiterated its next step: launching a new class of models smarter than Opus. As part of Project Glasswing, a few organizations are already using Claude Mythos Preview for cybersecurity work. Because this capability tier requires stronger cybersecurity safeguards for broad release, Anthropic says it is fast-tracking those protections and expects to bring Mythos-level models to all customers “within the coming weeks.”

Note that the most powerful publicly available model is still Opus 4.8; Mythos-level models are not yet generally available, and their actual capabilities and release timing remain uncertain.

Conclusion

Claude Opus 4.8 is a clearly positioned “consolidation” update, not a disruptive leap. It holds pricing steady while nudging up performance in coding (especially the harder SWE-bench Pro), agentic terminal tasks, and knowledge work, with a standout, anomalous jump in math reasoning on USAMO 2026.

But its real differentiation lies not in any single number, but in two “softer” dimensions: honesty—failing more gracefully and openly flagging uncertainty—and alignment—with a misbehavior rate now approaching Mythos levels. For those chasing absolute terminal coding performance, GPT-5.5 still holds an edge in Terminal-Bench 2.1. But for professional workflows that demand reliable, low-risk delegation of real work over long cycles, Opus 4.8’s reliability gains may outweigh the scores.

Given its rapid 41-day arrival after Opus 4.7 and Anthropic’s clear preview of Mythos on the horizon, Opus 4.8 feels like a steady transitional step on the eve of a bigger release.

Comments (0)

Share:X Hatena

Back to Blog