이 모델의 강점은 무엇인가요?

고급 코딩 기능 대규모 400K 컨텍스트 윈도우 OpenAI에 의해 최적화됨

이 모델의 약점은 무엇인가요?

비공개 소스 라이선스에 의해 제한됨 외부 접근 제한됨 폐쇄적 사용 조건

어떤 용도에 가장 적합한가요?

대규모 코드베이스 분석 복잡한 프로그램 구현 고급 버그 수정

모델 목록으로

OpenAI독점

OpenAI GPT-5.1-Codex-Max

Name: OpenAI GPT-5.1-Codex-Max
Author: OpenAI

OpenAI GPT-5.1-Codex-Max는 OpenAI에서 개발한 프로그래밍 전문 기반 모델입니다. 매우 긴 400K의 컨텍스트 윈도우를 특징으로 하여 대규모 코드베이스 처리에 적합합니다.

파라미터

Undisclosed

컨텍스트

400K

라이선스

Proprietary

출시일

2025-11-19

API 가격

이 모델의 API 가격 정보는 현재 공개되지 않았습니다

강점

・고급 코딩 기능
・대규모 400K 컨텍스트 윈도우
・OpenAI에 의해 최적화됨

약점

・비공개 소스 라이선스에 의해 제한됨
・외부 접근 제한됨
・폐쇄적 사용 조건

활용 사례

・대규모 코드베이스 분석
・복잡한 프로그램 구현
・고급 버그 수정

심층 분석

Release Date

November 19, 2025

Context Window

Effectively unlimited (compaction)

Input Price

$1.25 / 1M tokens

Output Price

$10.00 / 1M tokens

Cached Input

$0.625 / 1M tokens

SWE-bench Verified

77.9% (xhigh)

Terminal-Bench 2.0

58.1%

SWE-Lancer IC SWE

79.9%

Autonomous Operation

24+ hours continuous

Throughput

58.4 tok/s avg (11-110 range)

강점

・First model with context compaction — effectively unlimited context through iterative summarization
・SWE-bench Verified 77.9% with 30% fewer thinking tokens than predecessor
・Autonomous operation for 24+ hours on complex tasks
・Native Windows support — first OpenAI coding model to offer this
・Configurable reasoning effort (none/medium/high/xhigh) for cost/quality tradeoffs

약점

・High latency (2,060ms avg TTFT) with significant variability (169.3% CV)
・Context compaction can 'blur' details over very long sessions
・METR evaluation suggests 80% reliability time-horizon is ~2 hours, not 24
・Follows instructions very literally — may not recognize obvious typos
・Higher code churn compared to Claude Code (30% more reworks)

경쟁사 비교

Model	Arena	SWE	GPQA	Price
Claude Opus 4.5	~1450	80.9%	~92%	$15/$75 per 1M tokens
Gemini 3 Pro	~1420	76.2%	~90%	$3.50/$10.50 per 1M tokens
GPT-5.1 Codex	~1400	73.7%	~88%	$1.25/$10 per 1M tokens
Cursor (varies)	N/A	Varies	N/A	$20/month subscription

개요

GPT-5.1-Codex-Max is OpenAI's frontier agentic coding model released November 19, 2025, featuring revolutionary context compaction technology for effectively unlimited context. It achieves SWE-bench Verified 77.9% with 30% fewer thinking tokens than its predecessor and can operate autonomously for over 24 hours. It replaced GPT-5.1-Codex as the default across all Codex surfaces.

벤치마크 및 성능

GPT-5.1-Codex-Max achieves SWE-bench Verified 77.9% at xhigh reasoning (76.5% at high), SWE-Lancer IC SWE 79.9%, and Terminal-Bench 2.0 58.1%. It uses 30% fewer thinking tokens than GPT-5.1-Codex while achieving better results. The context compaction system reduces overall tokens by 20-40% in long sessions. Average throughput is 58.4 tok/s with high variability (11-110 range). Claude Opus 4.5 leads on SWE-bench at 80.9%.

상세 비교

GPT-5.1-Codex-Max trails Claude Opus 4.5 on SWE-bench (77.9% vs 80.9%) but offers effectively unlimited context through compaction vs Claude's 200K. It beats Gemini 3 Pro (76.2%) on SWE-bench. The model has higher code churn than Claude Code but offers native Windows support and 24+ hour autonomous operation. At $1.25/$10 per 1M tokens, it is significantly cheaper than Claude Opus ($15/$75) while being competitive on benchmarks.

커뮤니티 평가

Developers report it as a 'significant advancement' over GPT-5.1-Codex, with one developer successfully creating a 64-bit SMP OS with 100K+ lines of code. 95% of OpenAI engineers reportedly use Codex weekly, shipping ~70% more PRs. The main criticisms are high latency, literal instruction following, and context compaction causing detail loss over time. The model is preferred for large autonomous coding tasks but Claude is preferred for interactive development.

활용 사례

Best for: autonomous code generation and refactoring over extended periods, large-scale project scaffolding, multi-file architecture changes, continuous integration and testing pipelines, Windows-native development, and complex debugging sessions. The configurable reasoning effort makes it flexible: use 'medium' for everyday tasks, 'xhigh' for hardest problems (race conditions, legacy systems). NOT ideal for: quick completions (overkill), sub-5-minute tasks, interactive pair programming, or security-critical code where Claude's lower churn is preferred.