Anthropic2026-05-23

Claude Opus 4.7 Review: Anthropic's Most Powerful AI Model Dominates Agentic Coding but Faces Backlash Over Hidden Cost Increases

Introduction: Anthropic's Boldest Release Yet

On April 16, 2026, Anthropic launched Claude Opus 4.7 — its most capable generally available model to date. Succeeding Opus 4.6 after just a two-month development cycle, this release represents a significant leap forward in agentic coding, multi-tool orchestration, and visual reasoning. Built on the new Mythos architecture, Opus 4.7 introduces several headline features: adaptive thinking, a new "xhigh" effort level, task budgets for agentic loops, and self-verification capabilities that let the model catch its own logic errors before returning results.

But the launch hasn't been without controversy. A new tokenizer that inflates token counts by up to 35%, collapsing long-context retrieval scores, and over-aggressive safety guardrails have drawn sharp criticism from developers, writers, and security professionals alike. In this comprehensive review, we'll break down what Opus 4.7 does well, where it falls short, how it compares to GPT-5.4 and Gemini 3.1 Pro, and whether it's worth the upgrade for your specific use case.

What's New in Claude Opus 4.7

Adaptive Thinking Replaces Extended Thinking

Anthropic has fully removed the extended thinking feature that allowed developers to allocate a visible reasoning budget via budget_tokens. In its place, Opus 4.7 introduces adaptive thinking — a system that dynamically allocates reasoning effort based on task complexity. The old temperature, top_p, and top_k parameters at non-default values now return 400 errors, and thinking content is omitted from responses by default. Developers must opt in with display: "summarized" to see any reasoning traces.

This change has been one of the most divisive aspects of the release. Developers who relied on visible chain-of-thought for debugging and prompt engineering workflows found themselves scrambling to adapt.

New 'xhigh' Effort Level

Opus 4.7 introduces an "xhigh" effort tier sitting between "high" and "max." On Humanity's Last Exam, the hardest reasoning benchmark available, xhigh (55.4%) actually outperforms max (54.7%), suggesting diminishing returns beyond xhigh for many tasks. Claude Code defaults to xhigh across all plans, giving teams a practical lever to balance capability and cost.

Task Budgets for Agentic Loops

Now in public beta, task budgets allow developers to set advisory token targets across multi-step agentic loops — a minimum of 20,000 tokens per task, enabled via the task-budgets-2026-03-13 beta header. This is a critical feature for production agent deployments where runaway token consumption can silently destroy margins.

Self-Verification

Perhaps the most impactful capability for production use: Opus 4.7 can identify its own logic errors — off-by-one bugs, race conditions, incorrect assumptions — before reporting results. This isn't just a benchmark trick; early enterprise adopters report that it translates directly into fewer retries and higher first-attempt correctness.

Project Glasswing and Cybersecurity Safeguards

Opus 4.7 is the first model to ship under Anthropic's new Project Glasswing framework, which deliberately reduces the model's offensive cybersecurity capabilities during training. In an unprecedented move, Anthropic also publicly acknowledged that a more powerful internal model — Claude Mythos Preview — exists but is being withheld from broad release while safeguards are tested. This is the first time a major AI lab has openly stated it deliberately nerfed a model's capabilities for safety reasons.

Benchmark Performance: Where Opus 4.7 Excels

Claude Opus 4.7 scores 57 out of 100 on the Artificial Analysis Intelligence Index (up from 53 for Opus 4.6), placing it in a statistical three-way tie at the top alongside GPT-5.4 (56.8) and Gemini 3.1 Pro (57.2) within the benchmark's ±1 point confidence interval. But the aggregate number masks enormous variance across individual benchmarks.

Agentic Coding: The Headline Strength

The single biggest improvement in Opus 4.7 is in software engineering benchmarks:

SWE-bench Verified: 87.6% (up from 80.8% for Opus 4.6)
SWE-bench Pro: 64.3% (up from 53.4%) — the highest score among all generally available models and a +10.9 point improvement
Terminal-Bench 2.0: 69.4% (up from 65.4%)

On SWE-bench Pro, the hardest software engineering benchmark available, Opus 4.7 leads GPT-5.4 (57.7%) by 6.6 points and Gemini 3.1 Pro (54.2%) by 10.1 points. This is the model's clear competitive moat.

Multi-Tool Orchestration

Opus 4.7 dominates MCP-Atlas, the benchmark for multi-turn tool orchestration, scoring 77.3% versus GPT-5.4's 68.1% and Gemini 3.1 Pro's 73.9%. That 9-point gap over GPT-5.4 is meaningful for anyone building agent workflows that chain multiple API calls together.

Agentic Knowledge Work

On GDPval-AA, which measures general agentic knowledge work, Opus 4.7 scores 1,753 Elo — a commanding 79-point lead over the next competitors (Sonnet 4.6 and GPT-5.4, both around 1,674). This is where Anthropic's "real-world agent" positioning translates into hard numbers.

Computer Use and Vision

Opus 4.7 brings a 3x increase in image resolution (to 3.75 megapixels) and a massive 13-point jump on CharXiv (chart reasoning): 82.1% versus 69.1% for Opus 4.6. OSWorld-Verified (GUI interaction) improved from 72.7% to 78.0%, leading GPT-5.4's 75.0%. XBOW reported their visual-acuity benchmark leaping from 54.5% to 98.5%.

Reasoning

GPQA Diamond: 94.2% — effectively tied with GPT-5.4 (94.4%) and Gemini 3.1 Pro (94.3%)
Humanity's Last Exam (no tools, max effort): 54.7% — a strong showing, though this benchmark has limited competitive data
Hallucination resistance (AA-Omniscience Index): Score of 26 (up from 14 for Opus 4.6), with hallucination rate dropping from 61% to 36%

Token Efficiency

Despite scoring 4 points higher on the Artificial Analysis Intelligence Index, Opus 4.7 used approximately 35% fewer output tokens than Opus 4.6 — the self-verification and adaptive thinking features appear to produce more concise, accurate outputs.

Critical Regressions: Where Opus 4.7 Falls Short

Long-Context Retrieval Collapse

The most alarming regression: MRCR v2 8-needle retrieval at 1M tokens dropped from 78.3% to 32.2% — a devastating 46-point collapse. At 256K tokens, the drop is from 91.9% to 59.2%. Anthropic has indicated they are phasing out MRCR as a benchmark, but the practical implication is clear: if your workflow requires precise retrieval of specific facts from very long documents, Opus 4.6 remains the safer choice.

Web Research

BrowseComp dropped from 83.7% to 79.3%, leaving Opus 4.7 trailing GPT-5.4 (89.3%) by 10 points and Gemini 3.1 Pro (85.9%) by 6.6 points. For agentic browsing and web research tasks, Opus 4.7 is not the best available model.

Terminal-Heavy Coding

While SWE-bench scores are stellar, Terminal-Bench 2.0 shows Opus 4.7 at 69.4% versus GPT-5.4's 75.1% — a meaningful gap for developers who work in terminal-heavy environments.

Head-to-Head Comparisons

Claude Opus 4.7 vs. GPT-5.4

Both models share a 1M token context window and 128K max output. The key differences:

Pricing: Opus 4.7 costs $5/$25 per million input/output tokens — exactly 2x and 1.67x more expensive than GPT-5.4 ($2.50/$15)
Agentic coding: Opus 4.7 wins convincingly (SWE-bench Pro: 64.3% vs. 57.7%)
Tool orchestration: Opus 4.7 leads by 9.2 points on MCP-Atlas (77.3% vs. 68.1%)
Computer use: Opus 4.7 leads (78.0% vs. 75.0% on OSWorld)
Terminal coding: GPT-5.4 wins (75.1% vs. 69.4%)
Web research: GPT-5.4 dominates (89.3% vs. 79.3%)
Pure reasoning: Effectively tied (GPQA Diamond: 94.2% vs. 94.4%)

Bottom line: Choose Opus 4.7 when your workflow involves autonomous multi-file coding and complex tool chains. Choose GPT-5.4 for web research, terminal-heavy work, or when cost matters more than agentic capability.

Claude Opus 4.7 vs. Gemini 3.1 Pro

Gemini 3.1 Pro offers the same 1M context window at 2.5x cheaper input pricing ($2/1M) and 2x cheaper output pricing ($12/1M). On the Artificial Analysis Intelligence Index, both score 57 — a statistical tie.

Agentic coding: Opus 4.7 dominates (SWE-bench Pro: 64.3% vs. 54.2%)
Tool orchestration: Opus 4.7 leads (MCP-Atlas: 77.3% vs. 73.9%)
Hallucination resistance: Gemini wins (AA-Omniscience: 33 vs. 26)
Web research: Gemini leads (85.9% vs. 79.3%)
Reasoning: Effectively tied (GPQA Diamond: 94.2% vs. 94.3%)

Bottom line: For cost-sensitive workloads that don't require top-tier agentic coding, Gemini 3.1 Pro offers dramatically better value. Opus 4.7 is worth the premium only when agentic accuracy directly impacts your business outcomes.

Claude Opus 4.7 vs. Claude Opus 4.6

Same pricing ($5/$25), same context window, same max output. But the new tokenizer inflates token counts by 1.0–1.35x, creating an effective cost increase of up to 35% without any rate change. Community testing shows average English text inflating by 12–18%, with worst-case scenarios (code comments, URLs, special characters) hitting the full 35% ceiling.

Opus 4.6 remains fully available via claude-opus-4-6 with no announced deprecation (Google Vertex AI documentation lists retirement no sooner than February 5, 2027). If your workflow depends on long-context retrieval or you've optimized prompts around extended thinking, staying on Opus 4.6 may be the pragmatic choice.

Real-World Enterprise Results

Early-access enterprise partners reported transformative improvements:

Cursor: Internal benchmark jumped from 58% to 70% — called it "an easy upgrade decision"
Box: 56% fewer model calls and 50% fewer tool invocations for identical production workflows
Rakuten: 3x more production task resolution
Notion: 14% improvement on complex multi-step workflows, one-third the tool errors
Harvey (legal AI): 90.9% on BigLaw Bench at high effort
Hex: "Low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6"
Vercel: Praised for being "phenomenal on one-shot coding tasks"
KPMG: Announced a global strategic alliance integrating Claude across its Digital Gateway platform for all 276,000+ employees

Community Reception: Sharp Polarization

The launch has been sharply divided along use-case lines.

The Good

Major coding tools — Cursor, Replit, GitHub Copilot, Vercel, and Bolt — switched to Opus 4.7 immediately. Enterprise partners adopted at launch. The self-verification and improved agentic capabilities are genuine breakthroughs for production workflows.

The Bad

Individual developers and creative users pushed back hard. A Reddit post titled "Claude Opus 4.7 is a serious regression, not an upgrade" received over 2,300 upvotes in 24 hours. The top Hacker News comment (1,200+ upvotes) flagged that adaptive thinking removed visible reasoning traces by default.

The tokenizer controversy became the single most debated aspect of the release. Community consensus formed quickly that the new tokenizer constitutes a hidden 20–35% price increase buried in documentation rather than disclosed prominently. One documented case showed identical content growing from 4,262 tokens to 5,657 tokens — a 33% increase. Users reported hitting rate limits and plan caps dramatically faster.

Security professionals were particularly frustrated. Joseph Thacker, who holds approved access under Anthropic's Cyber Verification Program, reported that "opus 4.7 is blocking for all bug hunters/testers even if you're approved for cyber use." The Project Glasswing safeguards were catching legitimate security research, and Anthropic did not publicly acknowledge the issue at launch.

Writers and creative users reported a significant quality regression, with consistent complaints that the model feels "explicitly enterprise-grade" — replacing warmth with efficiency and favoring direct, opinionated output over nuanced creative range.

API Changes and Migration Considerations

Three breaking API changes require immediate attention:

Extended thinking removed: The budget_tokens parameter returns a 400 error. Adaptive reasoning is the only mode.
Sampling parameters restricted: temperature, top_p, and top_k at non-default values return 400 errors.
Thinking content hidden by default: Callers must opt in with display: "summarized" to see reasoning traces.

For teams with existing Opus 4.6 prompt stacks, parallel evaluation is strongly recommended before migrating. The model's behavioral shifts — more direct tone, different tokenization, removed parameters — mean that prompts optimized for Opus 4.6 may not transfer cleanly.

Best Use Cases for Claude Opus 4.7

Production agentic coding and CI/CD automation — The strongest generally available model for delegating complex, multi-file engineering tasks. Self-verification catches errors before they propagate.
Multi-tool orchestration — The 77.3% MCP-Atlas score makes it the clear leader for workflows chaining 5+ tool calls together.
Computer-use agents and vision-heavy workflows — Higher resolution image processing and dramatic improvements in chart/UI reasoning make it ideal for automated testing, document OCR, and design review.
Enterprise document analysis and financial workflows — Leading scores on Finance Agent v1.1 and strong hallucination reduction make it the best choice for legal, financial, and compliance-oriented knowledge work.

When to Choose an Alternative

Use Opus 4.6 if your workflow depends on precise long-context retrieval (256K+ tokens) or you've built infrastructure around extended thinking.
Use GPT-5.4 for web research, terminal-heavy coding, or when cost efficiency is paramount.
Use Gemini 3.1 Pro for cost-sensitive workloads where hallucination resistance and scientific reasoning matter more than agentic coding prowess.

The Bottom Line

Claude Opus 4.7 is a genuinely impressive model that cements Anthropic's leadership in agentic AI. Its improvements in software engineering (+10.9 points on SWE-bench Pro), tool orchestration (+9.2 points over GPT-5.4 on MCP-Atlas), and general agentic knowledge work (1,753 Elo on GDPval-AA) are real, measurable, and transformative for production deployments.

But the release also reveals growing tensions in Anthropic's strategy. The hidden tokenizer cost increase, collapsing long-context retrieval scores, over-aggressive safety guardrails, and removal of developer-facing controls like visible reasoning traces suggest a company increasingly optimizing for enterprise contracts at the expense of its individual developer community. The explicit acknowledgment that a stronger model exists but is being held back adds a layer of strategic complexity that competitors will exploit.

For enterprise teams building production agent systems, Opus 4.7 is the best tool available today — full stop. For individual developers, researchers, writers, and security professionals, the value proposition is far less clear, and Opus 4.6 remains a viable and in some dimensions superior alternative. Choose based on your specific workload, not the aggregate benchmark score.

Comments (0)

Share:X Hatena

Back to Blog