Back to Leaderboard

Coding Capability

Programming ability benchmarks: SWE-bench Verified, LiveCodeBench, SWE-bench Pro, Aider-Polyglot.

698 models

#ModelDeveloperOpen Source
1Claude Mythos PreviewAnthropic93.977.882.0Closed
2Opus 4.7Anthropic87.664.369.4Closed
3Claude Sonnet 4.5Anthropic82.071.043.6Closed
4Claude Sonnet 5Anthropic82.0Closed
5Opus 4.5Anthropic80.987.059.3Closed
6Claude Opus 4.6Anthropic80.876.065.4Closed
7Gemini 3.1 Pro PreviewGoogle DeepMind80.691.754.268.5Closed
8DeepSeek-V4-ProDeepSeek80.693.559.1Closed
9Qwen3.7-Max-Previewアリババ80.491.669.7Closed
10Kimi K2.6Moonshot AI80.289.666.7Closed
11MiniMax M2.5MiniMax80.255.451.7Closed
12Claude Sonnet 4Anthropic80.266.042.7Closed
13GPT-5.2OpenAI80.055.6Closed
14Claude Sonnet 4.6Anthropic79.659.1Closed
15DeepSeek-V4-FlashDeepSeek79.091.656.9Closed
16Qwen 3.6 Plus Previewアリババ78.887.156.661.6Closed
17Qwen3.6-Max-Previewアリババ78.887.165.4Closed
18GLM-5Zhipu AI77.861.1Closed
19Muse SparkMeta AI77.459.0Closed
20Qwen3.6-27Bアリババ77.283.959.3Closed
21Kimi K2.5Moonshot AI76.885.050.8Closed
22GPT-5.1-Codex-MaxOpenAI76.8Closed
23Qwen3.5-397B-A17Bアリババ76.483.650.952.5Closed
24GPT-5.1OpenAI76.350.847.6Closed
25Gemini 3.0 Pro (Preview 11-2025)Google DeepMind76.292.054.2Closed
26Qwen3-Max-Thinkingアリババ75.385.9Closed
27o3-proOpenAI75.0Closed
28M2.1MiniMax74.832.647.9Closed
29Opus 4.1Anthropic74.5Closed
30GPT-5 CodexOpenAI74.5Closed

About Benchmarks

SWE-bench Verified
実践的ソフトウェア開発タスク — 実際のバグ修正能力を測定
LiveCodeBench
リアルタイムコーディングベンチマーク — 最新のプログラミング問題への対応能力を測定
SWE-bench Pro
プロフェッショナルSWEベンチマーク — より複雑なソフトウェア開発タスクを測定
Aider-Polyglot
多言語コーディングアシスタントベンチマーク — 複数プログラミング言語のコーディング能力を測定