Alibaba2026-05-22

Qwen3.7 Complete Guide: Max vs Plus Benchmarks, Pricing, and API Integration

What is Qwen3.7?

On May 20, 2026, Alibaba officially unveiled Qwen3.7-Max at the Alibaba Cloud Summit. Designed as a foundation model for the agent era, it goes beyond conversational AI — capable of writing and debugging code, automating office workflows, and sustaining autonomous execution across hundreds or thousands of steps.

The Qwen3.7 series includes two models:

Model	Positioning	Availability
Qwen3.7-Max	Flagship. Strongest agent capabilities	API only (closed-source)
Qwen3.7-Plus	High-performance balanced variant	API only (closed-source)

Unlike Qwen3.6-27B and Qwen3.6-35B-A3B which are open-source under Apache 2.0, the 3.7 series is currently available only via API.

Arena AI Rankings

Qwen3.7-Max-Preview appeared on Arena AI (formerly LMArena) on May 19, 2026, immediately drawing attention.

Text Overall Ranking: #13 (between GPT 5.5 and Grok 4.2), #1 among domestic (Chinese) models

Vision Ranking: Qwen3.7-Plus-Preview at #16

Qwen3.7-Max Arena AI Rankings

Qwen3.7-Max Sub-category Rankings

According to Artificial Analysis, Qwen3.7-Max scored 56.6 overall — approaching GPT, Claude, and Gemini's strongest models, ranking #1 domestically and #5 globally.

Detailed Benchmark Scores

BenchLM Overall Assessment

Per BenchLM.ai, Qwen3.7-Max scored 92/100 overall, ranking #3 of 117 models. Arena Elo: 1475.

Category	Score	Ranking
Coding	92.2	#4
Reasoning	96.4	—
Agentic	87.7	—
Knowledge	86.8	#9
Multilingual	88.2	#10
Instruction Following	93.6	#7

Arena Elo Breakdown

Category	Elo	Votes
Text Overall	1475	3,741
Coding	1525	1,135
Math	1499	218
Hard Prompts	1496	2,546
Multi-turn	1484	648

Head-to-Head: Qwen3.7 vs Claude, GPT, DeepSeek

Coding Agents

Benchmark	Qwen3.7-Max	Claude Opus 4.6	DeepSeek V4 Pro	GPT-5.5
SWE-Pro	60.6	—	—	—
SWE-Multilingual	78.3	—	—	—
SWE-Verified	80.4	80.8	80.6	—
Terminal-Bench 2.0	69.7	—	67.9	—
SciCode	53.5	—	—	—

On SWE-Verified, Qwen3.7-Max (80.4) is on par with Claude Opus 4.6 Max (80.8) and DeepSeek V4 Pro Max (80.6). On Terminal-Bench 2.0, it outperforms DeepSeek V4 Pro Max (67.9).

General-Purpose Agents

Benchmark	Qwen3.7-Max	Claude Opus 4.6	GLM 5.1	Kimi K2.6
MCP-Mark	60.8	—	57.5	—
MCP-Atlas	76.4	75.8	—	—
SkillsBench	59.2	—	—	56.2
BFCL-V4	75.0	—	—	—
SpreadSheetBench-v1	87.0	—	—	—
Kernel Bench L3	1.98x / 96%	—	—	—

MCP-Atlas: edges out Claude Opus 4.6 (75.8). SkillsBench: surpasses Kimi K2.6 (56.2).

Reasoning

Benchmark	Qwen3.7-Max	Claude Opus 4.6	DeepSeek V4 Pro
GPQA Diamond	92.4	91.3	—
HLE	41.4	40.0	—
HMMT 2026 Feb	97.1	96.2	—
IMOAnswerBench	90.0	—	89.8
Apex	44.5	—	38.3

Qwen3.7-Max consistently outperforms Claude Opus 4.6 on reasoning benchmarks. GPQA Diamond 92.4 is among the highest publicly reported scores.

General Capabilities & Multilingual

Benchmark	Qwen3.7-Max	DeepSeek V4 Pro
IFBench	79.1	77.0
WMT24++	85.8	—
MAXIFE	89.2	—
SuperGPQA	73.6	—

The 35-Hour Experiment: The Most Important Result

Beyond benchmark scores, the most striking achievement is Qwen3.7-Max's 35-hour fully autonomous optimization task.

The Task

Alibaba tasked Qwen3.7-Max with optimizing an inference kernel on the T-Head ZW-M890 — a chip the model had never seen during training. No hardware documentation, no profiling data, no example kernels were provided. Only a task description, the existing SGLang implementation, and an evaluation script.

Results

Duration: 35 hours continuous (zero human intervention)
Tool calls: 1,158
Kernel evaluations: 432
Outcome: 10.0x geometric mean speedup over the Triton reference

The model maintained a coherent optimization strategy throughout. After 30+ hours, it was still finding meaningful improvements — demonstrating that long-horizon autonomous optimization is not just feasible but productive.

Comparison with Other Models

Model	Speedup	Notes
Qwen3.7-Max	10.0x	Completed 35 hours
GLM 5.1	7.3x	—
Kimi K2.6	5.0x	—
DeepSeek V4 Pro	3.3x	Stopped early
Qwen3.6-Plus	1.1x	Stopped early

Models that stopped early did so because they issued no tool calls for five consecutive rounds — the model itself concluded it could no longer make progress.

KernelBench L3

On KernelBench L3, Qwen3.7-Max produced accelerated kernels for 96% of scenarios:

Model	Accelerated Kernel Rate
Claude Opus 4.6	98%
Qwen3.7-Max	96%
GLM 5.1	78%
Kimi K2.6	80%
DeepSeek V4 Pro	54%

YC-Bench: Startup Management Simulation

Qwen3.7-Max also excelled at YC-Bench, which simulates a full year-long startup lifecycle requiring hundreds of decisions across personnel management, contract screening, and malicious client identification.

Model	Total Revenue	Tasks Completed
Qwen3.7-Max	$2.08M	237
Qwen3.6-Plus	$1.05M	—
Qwen3.5-Plus	$352K	—

Qwen3.7-Max achieved 2x the revenue of its predecessor and 6x of the generation before that.

API Pricing

Qwen3.7-Max is available via Alibaba Cloud Model Studio.

Item	Price
Input tokens	$2.50 / 1M tokens
Output tokens	$7.50 / 1M tokens
Context window	1M tokens

This is significantly cheaper than Claude Opus 4.6 ($15 input / $75 output) — roughly 1/6 the input cost and 1/10 the output cost.

API Integration

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(
    api_key="your-dashscope-api-key",
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

completion = client.chat.completions.create(
    model="qwen3.7-max",
    messages=[{"role": "user", "content": "Write a Python function to merge two sorted linked lists."}],
    extra_body={"enable_thinking": True},
    stream=True
)

Claude Code Integration

Qwen APIs support the Anthropic API protocol, enabling direct use with Claude Code:

export ANTHROPIC_MODEL="qwen3.7-max"
export ANTHROPIC_SMALL_FAST_MODEL="qwen3.7-max"
export ANTHROPIC_BASE_URL=https://dashscope-intl.aliyuncs.com/apps/anthropic
export ANTHROPIC_AUTH_TOKEN=<your_api_key>

claude

OpenClaw Integration

curl -fsSL https://molt.bot/install.sh | bash
export DASHSCOPE_API_KEY=<your_api_key>
openclaw dashboard

Qwen Code

npm install -g @qwen-code/qwen-code@latest
qwen

The preserve_thinking Feature

Qwen3.7-Max supports preserve_thinking, which retains thinking content from all preceding turns in messages. This is recommended for agentic tasks where maintaining reasoning consistency across long multi-turn conversations is critical.

Qwen Release Timeline

Qwen3.7-Max arrived as the third consecutive monthly flagship release:

Qwen3.7-Max Release Timeline

Date	Model	Theme
Feb 2026	Qwen3.5-Max	Native multimodal agent
Mar 30, 2026	Qwen3.5-Omni	Full-modality support
Apr 2, 2026	Qwen3.6-Plus	Agent programming enhancement
Apr 16, 2026	Qwen3.6-35B-A3B	MoE open-source
Apr 22, 2026	Qwen3.6-27B	Dense model open-source
May 20, 2026	Qwen3.7-Max	Agent era benchmark

This monthly cadence of flagship releases is unprecedented in the industry.

Conclusion

Qwen3.7-Max represents Alibaba's most capable model for agent-driven workflows. Key takeaways:

Arena #1 domestic, #5 global — approaching GPT, Claude, Gemini
Reasoning: GPQA Diamond 92.4, surpassing Claude Opus 4.6 (91.3)
Coding: SWE-Pro 60.6, Terminal-Bench 69.7
35-hour autonomous experiment: 1,158 tool calls, 10x speedup
Pricing: $2.50/$7.50 per 1M tokens (~1/10 of Claude Opus)
Context: 1M tokens
Integration: Claude Code, OpenClaw, Qwen Code supported

In the agent era, Qwen3.7-Max sets a new standard — not just for being smart, but for being able to work autonomously for extended periods without losing coherence.

Comments (0)

Share:X Hatena

Back to Blog