Open Source2026-05-13

StepFun Open-Sources Lightning-Fast MoE Model Step-3.5-Flash: 350 tokens/s with Only 11B Active Parameters

Chinese AI startup StepFun has released its foundational language model, Step-3.5-Flash, to the open-source community. The model's primary claim to fame is its successful combination of frontier-level performance and extremely high inference speed, a design explicitly optimized for agentic applications, potentially marking a significant milestone for the practical deployment of generative AI.

Sparse MoE Architecture: The Core Technology Balancing Scale and Efficiency

The key to Step-3.5-Flash's remarkable efficiency lies in its adoption of a Sparse Mixture-of-Experts (MoE) architecture. Instead of a single monolithic network, MoE employs multiple parallel "expert" sub-networks. It dynamically activates only the necessary experts based on each input token, drastically reducing computational load.

While the model's total parameter count is a massive 196 billion, the active parameters used for any single token's inference are just 11 billion, approximately 5.6% of the total. This achieves an ideal balance between scale and speed: the model retains the vast "knowledge" of 196B parameters while requiring the computational resources of only an 11B model for "thinking." This yields two key advantages:

Enhanced Memory Efficiency: Compared to dense models that require loading all parameters, the required memory footprint is substantially reduced.
Improved Computational Speed: With fewer active operations, it enables significantly faster token generation on the same hardware.

Commanding Performance Data: More Than Just Speed

According to the published technical details and benchmarks, the model's strength extends beyond pure velocity.

Inference Speed Impact

Standard Throughput: 100-300 tokens/s
Peak Throughput (e.g., for coding): Up to 350 tokens/s

These figures are likely measured on a setup like a single A100 80GB GPU. Compared to dense models of similar class that often operate in the tens of tokens per second, this represents nearly an order-of-magnitude speedup.

General Performance & Agent Optimization

Performance-wise, the model scores on par with or above recent competitors in China, such as models from the Kimi K2.5 and Qwen2.5 series. Standard tests like MMLU (common sense reasoning), GSM8K (math), and HumanEval (coding) demonstrate minimal performance degradation from the MoE approach.

Furthermore, the model is designed for "reliable agentic intelligence." Beyond single-turn Q&A, it is optimized to demonstrate high success rates and consistency in complex workflows involving multi-step reasoning, tool use, and planning.

Another Key to Blazing Speed: Multi-Token Prediction (MTP-3)

The secret behind Step-3.5-Flash's speed isn't just MoE. The adoption of Multi-Token Prediction (MTP), specifically "MTP-3" which predicts three tokens at once, is critically important. While traditional transformers predict only the next token, MTP predicts multiple future tokens in parallel. This dramatically improves throughput (tokens generated per unit time) while maintaining latency for the first token.

The astonishing "350 tokens/s" figure is thus the result of a powerful combination: the computational savings from MoE coupled with the throughput boost from MTP-3.

Industry Impact: The Rise of Chinese Open-Source MoE

This release underscores several clear trends in the current AI industry:

1. "Efficiency" Becomes the New Battleground

The competition is shifting from mere parameter count to "performance per unit compute cost." The focus is no longer solely on pursuing the highest accuracy, but on how to achieve practical speeds cost-effectively.

2. Strategic Open-Sourcing by Chinese Players

Following Meta's Llama series, Chinese firms like Qwen, Yi, DeepSeek, and now StepFun are actively releasing high-performance models. This is a strategic move to integrate the developer community into their ecosystems and establish industry standards.

3. Acceleration of AI Agent Practicality

The biggest bottleneck for agents has been the latency at each thinking step. Widespread adoption of ultra-fast models like Step-3.5-Flash could allow complex-task agents to operate within realistic timeframes, significantly lowering the barrier to practical use.

Actionable Insights for Developers

To leverage this technological wave, here are three recommended actions:

Conduct Quantitative Evaluation: Obtain the model from GitHub (stepfun-ai/Step-3.5-Flash) and benchmark it against your existing models for your specific use cases (e.g., summarization, code generation) to verify the favorable speed-quality trade-off.
Integrate into Agent Pipelines: Combine it with frameworks like LlamaIndex or LangChain to test multi-step tasks like "Web search → Summarization → Report generation." The responsiveness boost can dramatically improve user experience (UX).
Study State-of-the-Art Efficiency Tech: This model is a "live textbook" for cutting-edge techniques like MoE and MTP. Experiment with quantization (GPTQ, AWQ) and knowledge distillation to see how much you can reduce its size, building expertise for optimizing your own models.

Conclusion and Outlook

Step-3.5-Flash vividly demonstrates that "efficiency" is the paramount challenge for the practical application of LLMs.

Looking ahead, as this architecture is applied to even larger models (e.g., 1T+ parameters), we may enter an era of "GPT-4-class intelligence at real-time speeds." Should it become a de facto standard for a high-speed "reasoning engine," the seamless integration of AI into human business workflows will explode. For developers, this powerful new tool presents an excellent opportunity to accelerate the development of practical applications.

Comments (0)

Share:X Hatena

Back to Blog