このモデルの強みは何ですか？

コーディング性能が最高水準 SWE-bench Verifiedトップクラス 256Kコンテキストで大規模コードベース対応バッチAPIで50%割引

このモデルの弱みは何ですか？

汎用テキスト生成ではGPT-5.2に劣るコーディング以外の用途には不向き料金がやや高額

どんな用途に最適ですか？

大規模コード生成リファクタリング支援マルチファイルのデバッグ CI/CDパイプラインへの組み込み

モデル一覧に戻る

OpenAIプロプライエタリ

GPT-5.1 Codex Max

Name: GPT-5.1 Codex Max
Price: 2.5 USD
Author: OpenAI

OpenAIのコーディング特化モデルの最上位版。SWE-bench Verifiedで68.2を記録し、実践的なソフトウェア開発タスクにおいてトップクラスの性能を発揮する。

パラメータ

非公開

コンテキスト長

256K

ライセンス

プロプライエタリ

リリース日

2026-02-10

日本語性能

✅高品質日本語

多言語対応モデルのうち、日本語処理に優れた性能を持つモデル。

API料金

入力料金（1Mトークンあたり）

$2.5

出力料金（1Mトークンあたり）

$15

課金モード: standard

強み

・コーディング性能が最高水準
・SWE-bench Verifiedトップクラス
・256Kコンテキストで大規模コードベース対応
・バッチAPIで50%割引

弱み

・汎用テキスト生成ではGPT-5.2に劣る
・コーディング以外の用途には不向き
・料金がやや高額

活用例

・大規模コード生成
・リファクタリング支援
・マルチファイルのデバッグ
・CI/CDパイプラインへの組み込み

深度分析

SWE-Bench

72.1%

コーディング特化

入力価格

$10/1M

高コスト

出力価格

$30/1M

高コスト

コンテキスト

200Kトークン

コーディング向け

強み

・24時間以上の自律コーディングセッション
・コンテキスト圧縮技術による長時間実行
・GPT-5.1の改良版
・大規模リファクタリングに最適

弱み

・GPT-5.4で大幅な改善あり
・コストが高め
・API利用が限定的

競合比較

Model	Arena	SWE	GPQA	Price
GPT-5.1-Codex-Max	1349	77.9%	N/A	$1.25/$10.00
Claude Opus 4.5	N/A	80.9%	N/A	$17/month+
Gemini 3 Pro	N/A	76.2%	N/A	N/A

概要

GPT-5.1-Codex-Maxは、OpenAIの自律型ソフトウェアエンジニアリングに特化した最前線のモデルです。2025年11月19日にリリースされ、GPT-5.1の基盤を活かしつつ、エージェント的コーディングタスク向けの特別なトレーニングを受けています。このモデルの決定的な革新は「コンテキスト圧縮」です。これはネイティブなトレーニングプロセスであり、複数のコンテキストウィンドウを跨いで一貫した動作を可能にし、数百万トークンにわたる作業を数時間、あるいは数日間にわたって継続させます。これにより、単純なコード補完を超え、真の自律型開発ワークフローが実現します。 OpenAIのCodexエコシステム（CLI、IDE拡張機能、クラウド）におけるデフォルトモデルとして位置づけられているCodex-Maxは、プロジェクト規模のリファクタリング、深いデバッグセッション、長時間実行されるエージェントループを処理する必要があるプロフェッショナル開発者およびエンジニアリングチームを対象としています。高いベンチマークスコアを達成している一方で、その真の価値は運用の持続性とトークン効率にあります。同等のパフォーマンスを維持しながら、前身モデルより30%少ない「思考トークン」を使用します。このモデルは明示的に汎用チャットボットではなく、Codexのような環境向けに設計されており、開発ツールとの組み合わせで真価を発揮します。競合環境において、Codex-MaxはSWE-bench VerifiedにおいてAnthropicのClaude Opus 4.5にやや後れを取っています（77.9%対80.9%）が、他のコーディング評価では先行しています。GoogleのGemini 3 Proなどの競合製品に対する真の差別化要素は、長期ホライズンの自律性、ネイティブなWindowsサポート、OpenAIの開発者エコシステムとの統合という組み合わせです。価格設定はそのプレミアムポジショニングを反映しており、出力コストは100万トークンあたり10ドルです。これは汎用モデルより大幅に高いものの、高付加価値のソフトウェアエンジニアリング作業においては正当化されます。

ベンチマーク＆性能

モデル：GPT-5.1 Codex Max 分野：パフォーマンス GPT-5.1-Codex-Maxは、ソフトウェアエンジニアリングベンチマーク、特に自律的かつ長期間のコーディングタスクにおいて強力なパフォーマンスを示します。新しい「xhigh」推論努力度設定（より長い思考時間を許可するもの）により、SWE-bench Verified（実世界のソフトウェアエンジニアリング問題解決能力をテストする重要なベンチマーク）で77.9%を達成しました。これは、前身モデルであるGPT-5.1-Codex（「high」努力度で73.7%）に対する改善であり、思考トークンを30%少なく使用しながら達成されています。 **ベンチマークパフォーマンス（OpenAIから）** | ベンチマーク | GPT-5.1-Codex (high) | GPT-5.1-Codex-MAX (xhigh) | 改善幅 | |-----------|----------------------|----------------------------|-------------| | SWE-bench Verified (n=500) | 73.7% | 77.9% | +4.2% | | SWE-Lancer IC SWE | 66.3% | 79.9% | +13.6% | | Terminal-Bench 2.0 | 52.8% | 58.1% | +5.3% | BenchLMの暫定的分析による追加カテゴリスコアは、特定のドメインにおける強力なパフォーマンスを示しています： - **数学**: 97.2/100で第4位 - **推論**: 88.8/100で第6位 - **マルチモーダル**: 89.2/100で第9位このモデルの最も注目すべき強みは、エージェンティックコーディングタスク（BenchLMのエージェンティックカテゴリーで77.5/100）にあり、長期間にわたって自律的に作業することができます。長時間実行されるタスクについては、OpenAIは内部評価で24時間以上連続して作業するモデルを観察しており、コンテキストコンパクションを通じて一貫した進捗を維持しています。 *注：異なるベンチマークソースではスコアが異なります。Airank.devでは、SWE-rebenchで48.5%、Terminal Bench 2.0で60.4%を報告しており、ベンチマーク手法が結果に大きく影響することを示しています。*

詳細比較

**GPT-5.1-Codex-Max vs Claude Opus 4.5 (Anthropic):** - **パフォーマンス**: Claude Opus 4.5はSWE-bench Verifiedで優位（80.9%対77.9%）だが、Codex-Maxは長時間の自律タスクで優れる - **価格**: Codex-Maxは100万トークンあたり$1.25/$10（API）; Claude Opus 4.5の価格は公開されていないが、Claude Codeは月$17以上 - **コンテキスト**: Codex-Maxはコンパクションにより無制限のコンテキストを提供するのに対し、Claudeは固定の20万トークンウィンドウ - **強み**: Codex-Maxは数時間にわたる自律的リファクタリングに適する; Claude Opus 4.5はコードの書き換えが少ない（30%少ない再作業） - **ユースケース**: リポジトリ規模の移行にはCodex-Maxを選択; より繊細なコード理解にはClaudeを選択 **GPT-5.1-Codex-Max vs Gemini 3 Pro (Google):** - **パフォーマンス**: Codex-MaxはTerminal-Bench 2.0で優位（58.1%対54.2%）だが、Geminiは他の領域で優位 - **コンテキスト**: 両方とも大きなコンテキストウィンドウを提供するが、Codex-Maxのコンパクションにより事実上無制限のコンテキストが提供される - **エコシステム**: Codex-MaxはOpenAIのCodex CLIおよびツールと深く統合; Gemini 3 ProはGoogle Cloudとの緊密な統合を提供 - **価格**: Googleの価格体系が異なるため直接的な比較は困難 - **速度**: Codex-Maxは平均83トークン/秒でストリーミングし、初トークンまでの時間は1170ms **GPT-5.1-Codex-Max vs Cursor/Devin AI:** - **アーキテクチャ**: Codex-Maxはモデルである一方、CursorおよびDevinはエージェント型コーディングプラットフォームである - **統合**: Codex-MaxはCLI/IDE経由で既存の開発者ワークフローで動作; Devinはブラウザベースの自動化を提供 - **制御**: Codex-Maxは明示的な推論努力の制御（none/medium/high/xhigh）を提供; 競合製品は粒度の粗い制御を提供 - **Windowsサポート**: Codex-MaxはWindows向けにトレーニングされた初のOpenAIモデル; ほとんどの代替品はLinux/Macを必要とする

コミュニティ評価

Developer reactions have been mixed but positive overall. Reddit users report impressive results with one developer calling the model 'epic' after using it to write a 64-bit SMP operating system with over 100,000 lines of code. The model's ability to handle massive, complex systems has surprised many in the developer community. OpenAI internally reports widespread adoption: 95% of their engineers use Codex weekly, and these engineers ship roughly 70% more pull requests since adoption. This suggests strong productivity gains for software development teams. Some criticism focuses on the model's naming (GPT-5.1-Codex-Max xhigh) being overly complex, and practical concerns about the $10/1M output token pricing at scale. Developers note that while the model excels at autonomous work, it requires careful monitoring during long sessions to prevent 'giving up' or destructive changes. The cybersecurity community has noted Codex-Max's defensive capabilities—it's OpenAI's most capable cybersecurity model to date, though below their 'High' capability threshold. OpenAI has already disrupted cyber operations attempting to misuse their models, indicating both the model's power and the real-world security implications. Adoption patterns show developers using Codex-Max for maintenance and technical debt reduction rather than greenfield projects. The model works best in teams where it can handle routine implementation while humans focus on architecture and complex business logic.

ユースケース

**1. Large-Scale Codebase Refactoring:** Point Codex-Max at a legacy codebase (e.g., 15-year-old PHP application) and specify migration to a modern framework. It will analyze the architecture, create migration plans with dependency ordering, incrementally refactor modules while maintaining backward compatibility, implement tests, and document breaking changes. Ideal for framework migrations, dependency updates, and architectural modernization. **2. Deep Debugging and Technical Debt Remediation:** When facing intermittent test failures, race conditions, or complex bugs that span multiple files, Codex-Max can work for hours, iteratively testing hypotheses and fixing issues. It excels at untangling legacy data pipelines, fragile domain layers, and problems that would 'eat an afternoon of senior developer time.' **3. Security Vulnerability Remediation:** Upload security scan results (SAST/DAST findings), and Codex-Max will systematically analyze each vulnerability in context, implement fixes following OWASP best practices, add security tests to prevent regression, and work through hundreds of findings autonomously. Best for teams with accumulated security debt. **4. Project Scaffolding and Initial Implementation:** For new projects, provide a specification of tech stack and requirements, and Codex-Max can complete initial setup—including authentication, database migrations, CI/CD pipelines, and deployment configurations—in 45-90 minutes rather than 8-12 human hours. Works best for well-defined projects with clear specifications. **When to Choose Over Alternatives:** - Choose Codex-Max over Claude Code when: working on Windows, needing longer autonomous operation (>4 hours), or requiring explicit reasoning effort control - Choose over Cursor/Devin when: working with existing CLI/IDE workflows, needing model-level access for custom integrations, or requiring 400K+ context handling - Choose over general models (GPT-5.1, etc.) when: task requires sustained autonomous work, repository-scale understanding, or specialized coding agent behavior