Back to Leaderboard

AI Agent Capability

Autonomous agent benchmarks: τ²-Bench, Terminal Bench Hard, Aider-Polyglot.

698 models

#ModelDeveloperOpen Source
1Claude Opus 4.6Anthropic91.991.965.4Closed
2Gemini 3.1 Pro PreviewGoogle DeepMind90.890.868.5Closed
3Gemini 3.0 FlashGoogle DeepMind90.290.247.6Closed
4GLM-5Zhipu AI89.789.761.1Closed
5Step 3.5 FlashStepFun88.288.251.0Closed
6GLM-4.7Zhipu AI87.487.441.0Closed
7Qwen3.5-397B-A17Bアリババ86.786.752.5Closed
8Gemini 3.0 Pro (Preview 11-2025)Google DeepMind85.485.454.2Closed
9Claude Sonnet 4.5Anthropic84.771.0Closed
10Grok 4.1 FastxAI82.782.7Closed
11Qwen3-Max-Thinkingアリババ82.182.1Closed
12GPT-5.2OpenAI82.082.0Closed
13Opus 4.5Anthropic82.082.059.3Closed
14DeepSeek V3.2DeepSeek80.380.346.4Closed
15GPT-5OpenAI80.080.0Closed
16GLM-4.7-FlashZhipu AI79.579.5Closed
17Qwen3.5-27Bアリババ79.079.041.6Closed
18MiniMax M2MiniMax77.277.2Closed
19Gemma 4 31BGoogle DeepMind76.976.9Closed
20GLM-4.6Zhipu AI75.975.9Closed
21Qwen3 Max (Preview)アリババ74.074.0Closed
22Claude Opus 4Anthropic72.572.5Closed
23Gemma 4 26B A4BGoogle DeepMind68.268.2Closed
24DeepSeek V3.2-ExpDeepSeek66.766.7Closed
25Kimi K2Moonshot AI64.364.3Closed
26Claude Sonnet 3.7Anthropic61.861.8Closed
27OpenAI o4 - miniOpenAI56.956.9Closed
28GPT-4.1OpenAI54.754.7Closed
29GPT-4.1 miniOpenAI53.053.0Closed
30Claude Sonnet 4Anthropic52.052.0Closed

About Benchmarks

τ²-Bench
自律エージェントタスク — ツール呼び出しと推論の組み合わせ能力を測定
Terminal Bench Hard
ターミナルベースのエージェントタスク — CLI環境での自律的能力を測定
Aider-Polyglot
多言語コーディングアシスタントベンチマーク — 複数プログラミング言語のコーディング能力を測定