Back to Leaderboard
SWE-bench Verified 実践的ソフトウェア開発タスク — 実際のバグ修正能力を測定 LiveCodeBench リアルタイムコーディングベンチマーク — 最新のプログラミング問題への対応能力を測定 SWE-bench Pro プロフェッショナルSWEベンチマーク — より複雑なソフトウェア開発タスクを測定 Aider-Polyglot 多言語コーディングアシスタントベンチマーク — 複数プログラミング言語のコーディング能力を測定
Comprehensive RankingCoding CapabilityMath CapabilityAI Agent CapabilityReasoning CapabilityGeneral PerformanceOpenClaw Ranking
Coding Capability
Programming ability benchmarks: SWE-bench Verified, LiveCodeBench, SWE-bench Pro, Aider-Polyglot.
698 models
| # | Model | Developer | Open Source | ||||
|---|---|---|---|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | 93.9 | — | 77.8 | 82.0 | Closed |
| 2 | Opus 4.7 | Anthropic | 87.6 | — | 64.3 | 69.4 | Closed |
| 3 | Claude Sonnet 4.5 | Anthropic | 82.0 | 71.0 | 43.6 | — | Closed |
| 4 | Claude Sonnet 5 | Anthropic | 82.0 | — | — | — | Closed |
| 5 | Opus 4.5 | Anthropic | 80.9 | 87.0 | — | 59.3 | Closed |
| 6 | Claude Opus 4.6 | Anthropic | 80.8 | 76.0 | — | 65.4 | Closed |
| 7 | Gemini 3.1 Pro Preview | Google DeepMind | 80.6 | 91.7 | 54.2 | 68.5 | Closed |
| 8 | DeepSeek-V4-Pro | DeepSeek | 80.6 | 93.5 | — | 59.1 | Closed |
| 9 | Qwen3.7-Max-Preview | アリババ | 80.4 | 91.6 | — | 69.7 | Closed |
| 10 | Kimi K2.6 | Moonshot AI | 80.2 | 89.6 | — | 66.7 | Closed |
| 11 | MiniMax M2.5 | MiniMax | 80.2 | — | 55.4 | 51.7 | Closed |
| 12 | Claude Sonnet 4 | Anthropic | 80.2 | 66.0 | 42.7 | — | Closed |
| 13 | GPT-5.2 | OpenAI | 80.0 | — | 55.6 | — | Closed |
| 14 | Claude Sonnet 4.6 | Anthropic | 79.6 | — | — | 59.1 | Closed |
| 15 | DeepSeek-V4-Flash | DeepSeek | 79.0 | 91.6 | — | 56.9 | Closed |
| 16 | Qwen 3.6 Plus Preview | アリババ | 78.8 | 87.1 | 56.6 | 61.6 | Closed |
| 17 | Qwen3.6-Max-Preview | アリババ | 78.8 | 87.1 | — | 65.4 | Closed |
| 18 | GLM-5 | Zhipu AI | 77.8 | — | — | 61.1 | Closed |
| 19 | Muse Spark | Meta AI | 77.4 | — | — | 59.0 | Closed |
| 20 | Qwen3.6-27B | アリババ | 77.2 | 83.9 | — | 59.3 | Closed |
| 21 | Kimi K2.5 | Moonshot AI | 76.8 | 85.0 | — | 50.8 | Closed |
| 22 | GPT-5.1-Codex-Max | OpenAI | 76.8 | — | — | — | Closed |
| 23 | Qwen3.5-397B-A17B | アリババ | 76.4 | 83.6 | 50.9 | 52.5 | Closed |
| 24 | GPT-5.1 | OpenAI | 76.3 | — | 50.8 | 47.6 | Closed |
| 25 | Gemini 3.0 Pro (Preview 11-2025) | Google DeepMind | 76.2 | 92.0 | — | 54.2 | Closed |
| 26 | Qwen3-Max-Thinking | アリババ | 75.3 | 85.9 | — | — | Closed |
| 27 | o3-pro | OpenAI | 75.0 | — | — | — | Closed |
| 28 | M2.1 | MiniMax | 74.8 | — | 32.6 | 47.9 | Closed |
| 29 | Opus 4.1 | Anthropic | 74.5 | — | — | — | Closed |
| 30 | GPT-5 Codex | OpenAI | 74.5 | — | — | — | Closed |