Back to Blog
Alibaba

Qwen3.7 Complete Guide: Max vs Plus Benchmarks, Pricing, and API Integration

What is Qwen3.7?

On May 20, 2026, Alibaba officially unveiled Qwen3.7-Max at the Alibaba Cloud Summit. Designed as a foundation model for the agent era, it goes beyond conversational AI — capable of writing and debugging code, automating office workflows, and sustaining autonomous execution across hundreds or thousands of steps.

The Qwen3.7 series includes two models:

ModelPositioningAvailability
Qwen3.7-MaxFlagship. Strongest agent capabilitiesAPI only (closed-source)
Qwen3.7-PlusHigh-performance balanced variantAPI only (closed-source)

Unlike Qwen3.6-27B and Qwen3.6-35B-A3B which are open-source under Apache 2.0, the 3.7 series is currently available only via API.


Arena AI Rankings

Qwen3.7-Max-Preview appeared on Arena AI (formerly LMArena) on May 19, 2026, immediately drawing attention.

Text Overall Ranking: #13 (between GPT 5.5 and Grok 4.2), #1 among domestic (Chinese) models

Vision Ranking: Qwen3.7-Plus-Preview at #16

Qwen3.7-Max Arena AI Rankings

Qwen3.7-Max Sub-category Rankings

According to Artificial Analysis, Qwen3.7-Max scored 56.6 overall — approaching GPT, Claude, and Gemini's strongest models, ranking #1 domestically and #5 globally.


Detailed Benchmark Scores

BenchLM Overall Assessment

Per BenchLM.ai, Qwen3.7-Max scored 92/100 overall, ranking #3 of 117 models. Arena Elo: 1475.

CategoryScoreRanking
Coding92.2#4
Reasoning96.4
Agentic87.7
Knowledge86.8#9
Multilingual88.2#10
Instruction Following93.6#7

Arena Elo Breakdown

CategoryEloVotes
Text Overall14753,741
Coding15251,135
Math1499218
Hard Prompts14962,546
Multi-turn1484648

Head-to-Head: Qwen3.7 vs Claude, GPT, DeepSeek

Coding Agents

BenchmarkQwen3.7-MaxClaude Opus 4.6DeepSeek V4 ProGPT-5.5
SWE-Pro60.6
SWE-Multilingual78.3
SWE-Verified80.480.880.6
Terminal-Bench 2.069.767.9
SciCode53.5

On SWE-Verified, Qwen3.7-Max (80.4) is on par with Claude Opus 4.6 Max (80.8) and DeepSeek V4 Pro Max (80.6). On Terminal-Bench 2.0, it outperforms DeepSeek V4 Pro Max (67.9).

General-Purpose Agents

BenchmarkQwen3.7-MaxClaude Opus 4.6GLM 5.1Kimi K2.6
MCP-Mark60.857.5
MCP-Atlas76.475.8
SkillsBench59.256.2
BFCL-V475.0
SpreadSheetBench-v187.0
Kernel Bench L31.98x / 96%

MCP-Atlas: edges out Claude Opus 4.6 (75.8). SkillsBench: surpasses Kimi K2.6 (56.2).

Reasoning

BenchmarkQwen3.7-MaxClaude Opus 4.6DeepSeek V4 Pro
GPQA Diamond92.491.3
HLE41.440.0
HMMT 2026 Feb97.196.2
IMOAnswerBench90.089.8
Apex44.538.3

Qwen3.7-Max consistently outperforms Claude Opus 4.6 on reasoning benchmarks. GPQA Diamond 92.4 is among the highest publicly reported scores.

General Capabilities & Multilingual

BenchmarkQwen3.7-MaxDeepSeek V4 Pro
IFBench79.177.0
WMT24++85.8
MAXIFE89.2
SuperGPQA73.6

The 35-Hour Experiment: The Most Important Result

Beyond benchmark scores, the most striking achievement is Qwen3.7-Max's 35-hour fully autonomous optimization task.

The Task

Alibaba tasked Qwen3.7-Max with optimizing an inference kernel on the T-Head ZW-M890 — a chip the model had never seen during training. No hardware documentation, no profiling data, no example kernels were provided. Only a task description, the existing SGLang implementation, and an evaluation script.

Results

  • Duration: 35 hours continuous (zero human intervention)
  • Tool calls: 1,158
  • Kernel evaluations: 432
  • Outcome: 10.0x geometric mean speedup over the Triton reference

The model maintained a coherent optimization strategy throughout. After 30+ hours, it was still finding meaningful improvements — demonstrating that long-horizon autonomous optimization is not just feasible but productive.

Comparison with Other Models

ModelSpeedupNotes
Qwen3.7-Max10.0xCompleted 35 hours
GLM 5.17.3x
Kimi K2.65.0x
DeepSeek V4 Pro3.3xStopped early
Qwen3.6-Plus1.1xStopped early

Models that stopped early did so because they issued no tool calls for five consecutive rounds — the model itself concluded it could no longer make progress.

KernelBench L3

On KernelBench L3, Qwen3.7-Max produced accelerated kernels for 96% of scenarios:

ModelAccelerated Kernel Rate
Claude Opus 4.698%
Qwen3.7-Max96%
GLM 5.178%
Kimi K2.680%
DeepSeek V4 Pro54%

YC-Bench: Startup Management Simulation

Qwen3.7-Max also excelled at YC-Bench, which simulates a full year-long startup lifecycle requiring hundreds of decisions across personnel management, contract screening, and malicious client identification.

ModelTotal RevenueTasks Completed
Qwen3.7-Max$2.08M237
Qwen3.6-Plus$1.05M
Qwen3.5-Plus$352K

Qwen3.7-Max achieved 2x the revenue of its predecessor and 6x of the generation before that.


API Pricing

Qwen3.7-Max is available via Alibaba Cloud Model Studio.

ItemPrice
Input tokens$2.50 / 1M tokens
Output tokens$7.50 / 1M tokens
Context window1M tokens

This is significantly cheaper than Claude Opus 4.6 ($15 input / $75 output) — roughly 1/6 the input cost and 1/10 the output cost.


API Integration

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(
    api_key="your-dashscope-api-key",
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

completion = client.chat.completions.create(
    model="qwen3.7-max",
    messages=[{"role": "user", "content": "Write a Python function to merge two sorted linked lists."}],
    extra_body={"enable_thinking": True},
    stream=True
)

Claude Code Integration

Qwen APIs support the Anthropic API protocol, enabling direct use with Claude Code:

export ANTHROPIC_MODEL="qwen3.7-max"
export ANTHROPIC_SMALL_FAST_MODEL="qwen3.7-max"
export ANTHROPIC_BASE_URL=https://dashscope-intl.aliyuncs.com/apps/anthropic
export ANTHROPIC_AUTH_TOKEN=<your_api_key>

claude

OpenClaw Integration

curl -fsSL https://molt.bot/install.sh | bash
export DASHSCOPE_API_KEY=<your_api_key>
openclaw dashboard

Qwen Code

npm install -g @qwen-code/qwen-code@latest
qwen

The preserve_thinking Feature

Qwen3.7-Max supports preserve_thinking, which retains thinking content from all preceding turns in messages. This is recommended for agentic tasks where maintaining reasoning consistency across long multi-turn conversations is critical.


Qwen Release Timeline

Qwen3.7-Max arrived as the third consecutive monthly flagship release:

Qwen3.7-Max Release Timeline

DateModelTheme
Feb 2026Qwen3.5-MaxNative multimodal agent
Mar 30, 2026Qwen3.5-OmniFull-modality support
Apr 2, 2026Qwen3.6-PlusAgent programming enhancement
Apr 16, 2026Qwen3.6-35B-A3BMoE open-source
Apr 22, 2026Qwen3.6-27BDense model open-source
May 20, 2026Qwen3.7-MaxAgent era benchmark

This monthly cadence of flagship releases is unprecedented in the industry.


Conclusion

Qwen3.7-Max represents Alibaba's most capable model for agent-driven workflows. Key takeaways:

  1. Arena #1 domestic, #5 global — approaching GPT, Claude, Gemini
  2. Reasoning: GPQA Diamond 92.4, surpassing Claude Opus 4.6 (91.3)
  3. Coding: SWE-Pro 60.6, Terminal-Bench 69.7
  4. 35-hour autonomous experiment: 1,158 tool calls, 10x speedup
  5. Pricing: $2.50/$7.50 per 1M tokens (~1/10 of Claude Opus)
  6. Context: 1M tokens
  7. Integration: Claude Code, OpenClaw, Qwen Code supported

In the agent era, Qwen3.7-Max sets a new standard — not just for being smart, but for being able to work autonomously for extended periods without losing coherence.

Comments (0)

Share:XHatena

Post a Comment

Loading...