Back to Blog
Benchmark

SWE-bench Verified 2026 Rankings: Comparing Coding Performance Across 90 AI Models

SWE-bench Verified has emerged as one of the most trusted benchmarks for measuring the real-world coding capabilities of AI. Unlike synthetic tests, this benchmark consists of a subset of 500 actual GitHub issues verified by human annotators, drawn from prominent open-source projects like Django, Flask, and scikit-learn.

As of May 2026, 90 models have been evaluated. Here is a comprehensive breakdown of the latest results.

Top 10 Leaderboard: May 2026

RankModelDeveloperScorePricing (In/Out per 1M)License
1Claude Mythos PreviewAnthropic93.9%UndisclosedClosed
2Claude Opus 4.7 (Adaptive)Anthropic87.6%$5.00 / $25.00Closed
3GPT-5.3 CodexOpenAI85.0%UndisclosedClosed
4Claude Opus 4.5Anthropic80.9%UndisclosedClosed
5Claude Opus 4.6Anthropic80.8%$5.00 / $25.00Closed
5DeepSeek V4 Pro (Max)DeepSeek80.6%$1.74 / $3.48Open
5Gemini 3.1 ProGoogle DeepMind80.6%$2.50 / $15.00Closed
8Kimi K2.6Moonshot AI80.2%$0.95 / $4.00Open
8MiniMax M2.5MiniMax80.2%$0.30 / $1.20Open
10GPT-5.2OpenAI80.0%$1.25 / $10.00Closed

Key Analysis: Market Shifts

1. Anthropic's Dominance

Anthropic currently commands the leaderboard, holding four of the top five spots. Specifically, the Mythos Preview's score of 93.9% creates a massive 6.3 percentage point gap over the second-place Opus 4.7 (87.6%).

This isn't just a fluke of one benchmark. Anthropic has maintained top-tier performance across SWE-bench, GPQA Diamond, and HLE, signaling a fundamentally robust general architecture.

2. The Rise of Chinese Models

In 2025, Chinese models were virtually absent from the top 10. By May 2026, the landscape has shifted dramatically.

ModelDeveloperScoreRank
DeepSeek V4 Pro (Max)DeepSeek80.6%5th
Kimi K2.6Moonshot AI80.2%8th
MiniMax M2.5MiniMax80.2%8th
Qwen3.6 PlusAlibaba78.8%12th
MiMo-V2-ProXiaomi78.0%13th
GLM-5Zhipu AI77.8%15th

With three Chinese models in the top 10, DeepSeek V4 Pro is now performing on par with Gemini 3.1 Pro and Claude Opus 4.6. Notably, MiniMax M2.5 has become the most cost-effective model in the top 10, achieving 80.2% with an aggressive pricing strategy.

3. OpenAI's Current Struggle

sorprendingly, GPT-5.2 lingers at 10th place with 80.0%. While GPT-5.3 Codex ranks 3rd (85%), it is important to note that this is a specialized coding model. As a general-purpose model, GPT-5.2 is currently being outperformed by DeepSeek V4 Pro and Kimi K2.6.

4. Google's Positioning

Gemini 3.1 Pro remains competitive at 80.6% (tied for 5th). While Google continues to deliver frontier-class performance, they lack a standout leader in this specific category. All eyes are on the upcoming Gemini 3.5 Pro release to see if it can reclaim the top spots.

Price-to-Performance Analysis

When analyzing the relationship between the SWE-bench score and API costs, the value proposition becomes clear.

ModelScoreIn/1MOut/1MScore per Dollar (Output)
MiniMax M2.580.2%$0.30$1.2066.8
DeepSeek V4 Pro (Max)80.6%$1.74$3.4823.2
Kimi K2.680.2%$0.95$4.0020.1
GPT-5.280.0%$1.25$10.008.0
Gemini 3.1 Pro80.6%$2.50$15.005.4
Claude Sonnet 4.679.6%$3.00$15.005.3
Claude Opus 4.680.8%$5.00$25.003.2
Claude Opus 4.787.6%$5.00$25.003.5

MiniMax M2.5's efficiency is staggering. It achieves a score nearly identical to Claude Opus 4.6 (80.2% vs 80.8%) while being over 20 times cheaper in terms of output cost.

Open Source vs. Closed

Looking at the distribution across 90 models, the gap between open and closed models is narrowing.

LicenseTop 10Top 20Top 50Total (90)
Closed71225~50
Open3825~40

Having three open-source models (DeepSeek, Kimi, and MiniMax) in the top 10 represents a dramatic shift from 2025, when open-source models were almost non-existent at the highest tier.

Understanding Benchmark Limitations

While SWE-bench Verified is a gold standard, it primarily evaluates Python open-source projects. High scores do not necessarily guarantee:

  • Domain-specific performance: Results may vary on proprietary or niche codebases.
  • Multi-language proficiency: Performance in JavaScript, TypeScript, Go, or Rust may differ.
  • Long-term project management: The test measures the ability to solve a single issue, not the capacity to manage a project over weeks.
  • Creative architecture: Bug fixing is a different skill set than designing new features from scratch.

Outlook for Late 2026

We are rapidly approaching a ceiling on SWE-bench Verified. A score of 93.9% means the model correctly solved over 469 out of 500 issues. The remaining 31 are problems that humans deemed "solvable" but which currently baffle AI.

Future benchmarks will likely shift toward:

  • SWE-bench Pro: Higher difficulty subsets.
  • Terminal-Bench 2.1: Complex, terminal-based agentic tasks.
  • Multi-file/Multi-step execution: Long-term project autonomy.

Conclusion

The 2026 SWE-bench Verified results highlight three structural changes in AI coding:

  1. Anthropic's Supremacy: With the Mythos model, Anthropic has established itself as the current king of coding.
  2. The Frontier Parity of Chinese Models: Players like DeepSeek, Kimi, and MiniMax have reached parity with the best of US-based frontier models.
  3. The Democratization of Performance: Extreme cost-efficiencies, exemplified by MiniMax M2.5, are making frontier-level coding assistance accessible to everyone.

Comments (0)

Share:XHatena

Post a Comment

Loading...