Benchmark2026-07-01

Claude Sonnet 5 Arrives: Anthropic's Mid-Range Model Beats GPT-5.5 on 5 of 7 Benchmarks

On June 30, 2026, Anthropic officially released Claude Sonnet 5, the most powerful model in the Sonnet lineup to date. Serving as the new default for Free and Pro users, Sonnet 5 defeats GPT-5.5 across five out of seven shared benchmarks — all at roughly one-fifth the cost of Opus 4.8.

This is more than a routine upgrade. Sonnet 5 represents a watershed moment: for the first time, a mid-tier model outperforms the previous generation's flagship across multiple key benchmarks. For developers and enterprise users alike, the question of which model to use — and how to balance cost against performance — has never been more nuanced.

Specifications at a Glance

Spec	Claude Sonnet 5	Claude Sonnet 4.6	Claude Opus 4.8	GPT-5.5
Release Date	June 30, 2026	March 2026	May 28, 2026	April 23, 2026
Developer	Anthropic	Anthropic	Anthropic	OpenAI
Context Window	1M tokens	200K tokens	1M tokens	1,050K tokens
Max Output	128K tokens	—	128K tokens	—
Input Price (/1M tokens)	$2 (promo) → $3	$3	$5	$5
Output Price (/1M tokens)	$10 (promo) → $15	$15	$25	$30
Cache Hit Discount	90%	90%	90%	Yes
Batch Discount	50%	50%	50%	50%

Sonnet 5's pricing is aggressively competitive. During the promotional period (through August 31, 2026), input costs drop to just $2 per 1M tokens — matching Gemini 3.1 Pro on price while delivering significantly better performance. Even at its standard rate of $3/1M input, it's only 60% of GPT-5.5's cost.

Benchmark Breakdown: How Strong Is Sonnet 5?

Coding

Benchmark	Sonnet 5	Sonnet 4.6	Opus 4.8	GPT-5.5
SWE-bench Pro (Agentic Coding)	63.2%	58.1%	69.2%	58.6%
Terminal-Bench 2.1 (Terminal Coding)	80.4%	67.0%	82.7%	78.2%

SWE-bench Pro is the gold standard for evaluating AI coding agents. Sonnet 5's 63.2% doesn't just beat Sonnet 4.6 by 5.1 percentage points — it also surpasses GPT-5.5's 58.6%. In practical terms, Sonnet 5 resolves real GitHub issues roughly 5 percentage points more often than OpenAI's model.

Terminal-Bench 2.1 measures coding ability during extended terminal sessions. Here, Sonnet 5 scores 80.4%, again besting GPT-5.5 (78.2%) and closing in on Opus 4.8 (82.7%).

Computer Use

Benchmark	Sonnet 5	Sonnet 4.6	Opus 4.8	GPT-5.5
OSWorld-Verified (Desktop)	81.2%	78.5%	83.4%	78.7%

OSWorld-Verified tests how well an AI can operate within real desktop environments. Sonnet 5 reaches 81.2%, edging out GPT-5.5 (78.7%) and narrowing the gap with Opus 4.8 (83.4%) to just 2.2 points. For enterprises exploring AI-powered alternatives to traditional RPA, this is a meaningful signal.

Knowledge & Reasoning

Benchmark	Sonnet 5	Sonnet 4.6	Opus 4.8	GPT-5.5
HLE (with tools)	57.4%	46.8%	57.9%	52.2%
HLE (no tools)	43.2%	34.6%	49.8%	41.4%

Humanity's Last Exam (HLE) is among the most demanding reasoning benchmarks available. With tool access, Sonnet 5 hits 57.4% — virtually matching Opus 4.8 (57.9%) and pulling well ahead of GPT-5.5 (52.2%). This indicates that Sonnet 5's reasoning capabilities are now approaching Anthropic's top-tier flagship.

Pricing Comparison: The New Value King

Model	Input (/1M tokens)	Output (/1M tokens)	100K Input + 10K Output
Claude Sonnet 5 (promo)	$2	$10	$0.30
Claude Sonnet 5 (standard)	$3	$15	$0.45
Claude Sonnet 4.6	$3	$15	$0.45
Claude Opus 4.8	$5	$25	$0.75
GPT-5.5	$5	$30	$0.80
Gemini 3.1 Pro	$2	$8	$0.28

Take a typical coding task: 100K tokens of input (code context) plus 10K tokens of output (generated code). At Sonnet 5's promo price, that costs just $0.30 — less than 40% of GPT-5.5's $0.80. Even after the promotional window closes ($0.45), it's still only 56% of GPT-5.5's price.

It's worth noting that Gemini 3.1 Pro is even cheaper ($0.28), but it falls well behind Sonnet 5 on both coding and computer-use benchmarks. On a performance-per-dollar basis, Sonnet 5 is the clear winner.

Sonnet 5 vs Sonnet 4.6: Is the Upgrade Worth It?

The improvements over Sonnet 4.6 are comprehensive:

Dimension	Improvement	Notes
SWE-bench Pro	+5.1pp	Meaningful coding gains
Terminal-Bench 2.1	+13.4pp	Major leap in terminal proficiency
OSWorld-Verified	+2.7pp	More reliable desktop operations
HLE (with tools)	+10.6pp	Qualitative reasoning leap
HLE (no tools)	+8.6pp	Substantial unaided reasoning gains
Context Window	5×	200K → 1M tokens
Input Price	−33%	$3 → $2 (promo)

The standout improvements are the 13.4-point jump on Terminal-Bench 2.1 and the 5× context window expansion to 1M tokens. For current Sonnet 4.6 users, this is a no-brainer upgrade — better performance, bigger context, lower price.

Sonnet 5 vs GPT-5.5: Real-World Differences

Across seven shared benchmarks, Sonnet 5 takes a 5–2 lead:

Benchmark	Winner	Margin
SWE-bench Pro	Sonnet 5	+4.6pp
Terminal-Bench 2.1	Sonnet 5	+2.2pp
OSWorld-Verified	Sonnet 5	+2.5pp
HLE (with tools)	Sonnet 5	+5.2pp
HLE (no tools)	Sonnet 5	+1.8pp
CursorBench v3.1	GPT-5.5	+3.1pp
GDPval-AA	GPT-5.5	+151 Elo

GPT-5.5 still holds the edge on CursorBench (IDE-integrated coding) and GDPval-AA (real-world workloads), suggesting OpenAI retains advantages in productization and deployment maturity. But Sonnet 5's sweeping wins on core capability benchmarks — combined with its significant price advantage — make it the stronger choice for most use cases.

Which Model Should You Use?

For Developers

Use Case	Recommended Model	Why
Agentic coding (complex bug fixes, refactoring)	Claude Sonnet 5	SWE-bench Pro 63.2%, best value
IDE-integrated coding (daily work)	GPT-5.5	CursorBench 64.3%, deeper IDE integration
Terminal ops, long-running automation	Claude Sonnet 5	Terminal-Bench 80.4%, beats GPT-5.5
Mission-critical tasks requiring max accuracy	Claude Opus 4.8	Still the strongest model overall

For Enterprises

Use Case	Recommended Model	Why
Desktop automation / RPA replacement	Claude Sonnet 5	OSWorld 81.2%, only 40% the cost of Opus
Large-scale code review	Claude Sonnet 5	1M context + $2 input price
Customer service automation	GPT-5.5	Higher GDPval-AA, more productization experience
Document analysis, bulk data processing	Gemini 3.1 Pro	2M context + $2 input, lowest cost

Budget-First Strategies

Monthly Budget	Recommended Strategy
Generous	Use Opus 4.8 for critical tasks, Sonnet 5 for everything else
Moderate	Sonnet 5 as your primary model ($2/1M input) — it covers 90% of scenarios
Tight	Maximize Sonnet 5's promo pricing through August, then reassess whether to step down to Gemini

The Bottom Line: Sonnet 5 Is the Default for H2 2026

The arrival of Claude Sonnet 5 marks a new chapter in the AI model arms race. A mid-tier model has, for the first time, outperformed the previous generation's flagship across multiple key benchmarks — and it's doing so at a friendlier price point.

Key takeaways:

Sonnet 5 beats GPT-5.5 on 5 of 7 benchmarks while costing less than half as much
Sonnet 5's reasoning nearly matches Opus 4.8 (HLE with tools: 57.4% vs 57.9%) at just 40% of the price
For Sonnet 4.6 users, upgrading is a no-brainer — better across the board, 5× the context, lower price
Promo pricing runs through August 31 — the $2/$10 window is the best time to try Sonnet 5

The AI model landscape for the second half of 2026 is clear: Sonnet 5 is the default recommendation. Only reach for alternatives when you need maximum precision (Opus 4.8), the largest context window (Gemini 3.1 Pro), or the most mature IDE integration (GPT-5.5).

Comments (0)

Share:X Hatena

Back to Blog