AI Model Rankings
Comprehensive AI model rankings across 17 benchmarks. Detailed comparisons by category.
Comprehensive Ranking
Overall AI model ranking across HLE, ARC-AGI-2, FrontierMath, SWE-bench Verified, and τ²-Bench.
5 benchmarks
Coding Capability
Programming ability benchmarks: SWE-bench Verified, LiveCodeBench, SWE-bench Pro, Aider-Polyglot.
4 benchmarks
Math Capability
Mathematical reasoning benchmarks: AIME 2025/2026, FrontierMath, MATH-500, GSM8K.
5 benchmarks
AI Agent Capability
Autonomous agent benchmarks: τ²-Bench, Terminal Bench Hard, Aider-Polyglot.
3 benchmarks
Reasoning Capability
Reasoning and thinking benchmarks: HLE, ARC-AGI-2, GPQA Diamond.
3 benchmarks
General Performance
General AI performance: MMLU-Pro, LMArena Elo ratings.
2 benchmarks
OpenClaw Ranking
OpenClaw agent performance: Claw Bench and Pinch Bench.
2 benchmarks
Comprehensive Ranking
Overall scores across HLE, ARC-AGI-2, FrontierMath, SWE-bench, and τ²-Bench
698 models
| # | Model | Developer | Open Source | |||||
|---|---|---|---|---|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | 64.7 | — | — | 93.9 | — | Closed |
| 2 | GPT-5.4 Pro | OpenAI | 58.7 | 83.3 | 38.0 | — | — | Closed |
| 3 | Muse Spark | Meta AI | 58.0 | 42.5 | 14.6 | 77.4 | — | Closed |
| 4 | GPT-5.5 Pro | OpenAI | 57.2 | 84.6 | 39.6 | — | — | Closed |
| 5 | Opus 4.7 | Anthropic | 54.7 | 75.8 | 22.9 | 87.6 | — | Closed |
| 6 | Kimi K2.6 | Moonshot AI | 54.0 | — | — | 80.2 | — | Closed |
| 7 | Qwen3.7-Max-Preview | アリババ | 53.5 | — | — | 80.4 | — | Closed |
| 8 | Claude Opus 4.6 | Anthropic | 53.0 | 66.3 | 22.9 | 80.8 | 91.9 | Closed |
| 9 | GLM 5.1 | Zhipu AI | 52.3 | — | — | — | — | Closed |
| 10 | GPT-5.5 | OpenAI | 52.2 | 85.0 | 35.4 | — | — | Closed |
| 11 | GPT-5.4 | OpenAI | 52.1 | 77.1 | 27.1 | — | — | Closed |
| 12 | Gemini 3.1 Pro Preview | Google DeepMind | 51.4 | 77.1 | 16.7 | 80.6 | 90.8 | Closed |
| 13 | Kimi K2 Thinking | Moonshot AI | 51.0 | — | — | 71.3 | — | Closed |
| 14 | Qwen 3.6 Plus Preview | アリババ | 50.6 | — | — | 78.8 | — | Closed |
| 15 | GLM-5 | Zhipu AI | 50.4 | 4.9 | 2.1 | 77.8 | 89.7 | Closed |
| 16 | Kimi K2.5 | Moonshot AI | 50.2 | 11.8 | 4.2 | 76.8 | — | Closed |
| 17 | Qwen3.6-Max-Preview | アリババ | 50.2 | — | — | 78.8 | — | Closed |
| 18 | GPT-5.2 Pro | OpenAI | 50.0 | 54.2 | 31.3 | — | — | Closed |
| 19 | Qwen3-Max-Thinking | アリババ | 49.8 | — | — | 75.3 | 82.1 | Closed |
| 20 | Claude Sonnet 4.6 | Anthropic | 49.0 | 58.3 | 8.3 | 79.6 | — | Closed |
| 21 | Qwen3.5-27B | アリババ | 48.5 | — | — | 72.4 | 79.0 | Closed |
| 22 | Gemini 3 Deep Think - 2620 | Google DeepMind | 48.4 | 84.6 | — | — | — | Closed |
| 23 | Qwen3.5-397B-A17B | アリババ | 48.3 | — | — | 76.4 | 86.7 | Closed |
| 24 | DeepSeek-V4-Pro | DeepSeek | 48.2 | — | — | 80.6 | — | Closed |
| 25 | Gemini 3.0 Pro (Preview 11-2025) | Google DeepMind | 45.8 | 45.1 | 18.8 | 76.2 | 85.4 | Closed |
| 26 | GPT-5.2 | OpenAI | 45.5 | 54.2 | 18.8 | 80.0 | 82.0 | Closed |
| 27 | DeepSeek-V4-Flash | DeepSeek | 45.1 | — | — | 79.0 | — | Closed |
| 28 | Grok 4 Heavy | xAI | 44.4 | — | 2.1 | 73.5 | — | Closed |
| 29 | Gemini 3.0 Flash | Google DeepMind | 43.5 | 33.6 | 4.2 | 68.7 | 90.2 | Closed |
| 30 | Opus 4.5 | Anthropic | 43.2 | 37.6 | 4.2 | 80.9 | 82.0 | Closed |