이 모델의 강점은 무엇인가요?

Overwhelming parameter scale 1 million token long-context comprehension Pursuing advanced reasoning capabilities

이 모델의 약점은 무엇인가요?

Non-public closed model Limited license Detailed performance metrics not disclosed

어떤 용도에 가장 적합한가요?

Executing complex logical reasoning Analyzing ultra-large-scale data Automating advanced problem-solving

모델 목록으로

Alibaba독점

Qwen3-Max-Thinking

Name: Qwen3-Max-Thinking
Author: Alibaba

Qwen3-Max-Thinking is an inference model developed by Alibaba. It has a large-scale configuration of approximately 10 trillion parameters and a very long context window of 1 million tokens.

파라미터

10000.0B

컨텍스트

1000K

라이선스

Proprietary

출시일

2026-01-26

API 가격

이 모델의 API 가격 정보는 현재 공개되지 않았습니다

강점

・Overwhelming parameter scale
・1 million token long-context comprehension
・Pursuing advanced reasoning capabilities

약점

・Non-public closed model
・Limited license
・Detailed performance metrics not disclosed

활용 사례

・Executing complex logical reasoning
・Analyzing ultra-large-scale data
・Automating advanced problem-solving

심층 분석

Release Date

January 23, 2026

Parameters

Proprietary (undisclosed)

Context Window

262,144 tokens

Architecture

Decoder-only with extended thinking

Input Price

$0.78/1M tokens

Output Price

$3.90/1M tokens

GPQA Diamond

87.4

SWE-bench Verified

75.3

HLE (w/ tools)

49.8

API Model Name

qwen3-max-2026-01-23

강점

・Competitive with GPT-5.2-Thinking and Claude-Opus-4.5 on 19 established benchmarks
・Adaptive tool-use: autonomously invokes Search, Memory, and Code Interpreter
・Excellent value: $0.78/$3.90 pricing is much cheaper than GPT-5.2 and Claude Opus 4.5
・100% reliability rate across evaluated benchmarks — never fails to produce output
・Strong on C-Eval (93.7), HLE with tools (49.8), and coding tasks (97th percentile)

약점

・General knowledge is a notable weakness (23rd percentile on broad factual recall)
・Trails GPT-5.2-Thinking on MMLU-Pro (85.7 vs 87.4) and GPQA (87.4 vs 92.4)
・Now partially superseded by Qwen3.7-Max as the flagship reasoning model
・Test-time scaling adds latency and token cost for heavy mode
・SWE-bench 75.3% trails Claude Opus 4.5 (80.9%) and GPT-5.2 (80.0%)

경쟁사 비교

Model	Arena	SWE	GPQA	Price
GPT-5.2-Thinking	~1500	80.0	92.4	Proprietary
Claude-Opus-4.5	~1490	80.9	87.0	Proprietary
Gemini 3 Pro	~1480	76.2	91.9	Proprietary
Qwen3-Max-Thinking	~1450	75.3	87.4	$0.78/$3.90
DeepSeek V3.2	~1430	73.1	82.4	Proprietary

개요

Qwen3-Max-Thinking is Alibaba's flagship reasoning model from the Qwen3 generation, released January 23, 2026. It achieves competitive performance with GPT-5.2-Thinking and Claude-Opus-4.5 across 19 benchmarks while offering significantly lower pricing ($0.78/$3.90 per 1M tokens). Key innovations include adaptive tool-use capabilities and an experience-cumulative test-time scaling strategy that boosts reasoning through iterative self-reflection.

벤치마크 및 성능

Competitive across reasoning: GPQA Diamond 87.4, HMMT Feb 2025 98.0, HMMT Nov 2025 94.7, IMOAnswerBench 83.9, LiveCodeBench v6 85.9, HLE with tools 49.8. Knowledge: MMLU-Pro 85.7, C-Eval 93.7. Coding: SWE-Verified 75.3. Instruction following: IFBench 70.9, MultiChallenge 63.3. Long context: AA-LCR 68.7. Test-time scaling pushes GPQA from 90.3 to 92.8, LiveCodeBench from 88.0 to 91.4, and IMO-AnswerBench from 89.5 to 91.5.

상세 비교

Positions between GPT-5.2-Thinking and DeepSeek V3.2 on most benchmarks. On C-Eval (93.7), it beats all competitors including GPT-5.2 (90.5) and Claude Opus (92.2). On HLE with tools (49.8), it leads all competitors. However, it trails on GPQA (87.4 vs GPT-5.2's 92.4) and SWE-bench (75.3 vs Claude Opus' 80.9). At $0.78/$3.90, it is significantly cheaper than GPT-5.2 and Claude Opus pricing.

커뮤니티 평가

Well-received as a strong reasoning model at a competitive price point. The adaptive tool-use feature (autonomous search, memory, code interpreter) was praised as genuinely useful rather than gimmicky. The 100% reliability rate is valued for production use. Now somewhat overshadowed by Qwen3.7-Max but still relevant for cost-conscious reasoning workloads. Compatible with Claude Code via Anthropic API protocol.

활용 사례

Best for complex reasoning tasks, mathematical problem solving, scientific analysis, and coding where extended thinking improves results. The adaptive tool-use makes it suitable for research assistants and knowledge-intensive workflows. At $0.78/$3.90, it offers the best reasoning-per-dollar among models in its quality tier. For the absolute best reasoning, Qwen3.7-Max or GPT-5.5 are stronger but more expensive. Available via Alibaba Cloud Model Studio and OpenRouter.