Open Source2026-06-02

StepFun's Step 3.7 Flash: An Agent-Ready High-Efficiency Model at 1/9 the Cost of Claude Opus 4.6

In 1492, Columbus set sail into the deep Atlantic. An ocean voyage obviously required speed, but what determined whether a fleet could actually reach the other shore was whether its fresh water, food, hull, masts, and rigging could survive the long storms. It was this unromantic engineering logic that reshaped transoceanic trade. Later, the Dutch designed the "Fluyt" merchant ship: cheap to build, lightly crewed, with a large cargo hold, capable of steady return voyages across the Atlantic. Ocean travel transformed from the lonely courage of adventurers into a replicable, calculable, and scalable business.

Today's competition among AI models stands at a similar crossroads. In recent years, when people talked about models, they tended to talk about parameters, leaderboards, and peak performance. But after using coding agents like Claude Code and Codex, APPSO came to feel that as AI agents move toward production environments, the questions that truly matter have shifted somewhat: can it sustainably handle high-frequency requests, can it reliably call tools, can it understand complex interfaces, and can it be embedded into existing enterprise workflows and run over the long term. The answers to these questions are often not found on benchmark leaderboards.

Recently, StepFun officially released and open-sourced Step 3.7 Flash. As a next-generation Flash model aimed at production-grade agents, it primarily targets agents, coding, search, and multimodal workflows. The timing of its arrival captures this very juncture. What a production-grade agent needs is not simply speed and cheapness — reliability, ease of use, ease of deployment, and the ability to deliver results day after day in real workflows matter even more.

A Flash Model is No Longer Just a Flagship Stand-In

Traditionally, Flash models were seen as lightweight versions of flagship models, and their selling points were speed and price alone. But as agents become the core of workflows, the role of a Flash model has changed. When a model tends to drift off-target across multi-step tasks, neither enterprises nor individuals can adopt it with confidence. Conversely, only a model that can balance speed, cost, tool calling, multimodal understanding, and ecosystem compatibility can become a truly dependable foundational capability for an agent system.

In a sense, the Flash model the agent era demands has evolved from a "faster, smaller model" into "the foundation model with the highest production efficiency." It has to reach the capability ceiling of flagship models while withstanding the efficiency pressure of large-scale agent invocations. Step 3.7 Flash is positioned squarely as the latter — a next-generation agentic foundation model.

The first hurdle for a production-grade agent is understanding the real work environment. A large share of agent tasks shuttle between complex interfaces, office documents, chart systems, browser pages, specialized software, and internal tools. An agent that only excels at text question-answering will struggle to handle these tasks.

What Step 3.7 Flash focuses on strengthening is native multimodal understanding and execution. It can make sense of UIs, charts, documents, images, and application interfaces, and can autonomously crop, zoom in on, and re-read images when tackling complex visual problems. When information is uncertain, the model can also proactively initiate searches and cross-check text and image information.

There is a counterintuitive design philosophy at work here. For a Flash model with 11B active parameters, cramming vast amounts of visual knowledge into the weights is a poor return on investment. StepFun took the opposite approach: leave only the best reasoning engine inside the weights, and externalize the boundaries of perception and world knowledge to the inference stage, leveraging extremely high speed to "look a few more times and check a few more times" to make up for the capability that "parameters alone can't quite cover." Low latency and high throughput are no longer just deployment advantages here — they become part of the capability itself, both clever and shrewd. For example, in this cockpit operation demo, the user simply inputs "how to take off," and the model automatically frames the cockpit area, identifies instruments, buttons, and key operating information, understands the operating logic of the current interface, and generates a step-by-step tutorial.

The point here is not merely that it can recognize a cockpit image, but that it can turn a dense, unfamiliar, and heavily context-dependent visual environment into a task guide that a person can follow. There is a world of difference between being able to understand something and being able to teach someone how to actually do it.

Step 3.7 Flash was also integrated into a mobile GUI agent flow and demoed on a vivo smartphone. With the phone connected to a Mac via USB and ADB debugging authorization enabled, the terminal can capture the phone's current screenshot and mirror it via scrcpy. The script then sends that screenshot to Step 3.7 Flash, asking the model to judge what is happening on the screen.

For example, we had Step 3.7 Flash look at the WeChat Reading trending list on the phone. Rather than simply reading the text on the page, it also understood the chart structure: which entries were book titles, which were covers, what the current ranking was, how many people were reading each, and which book each recommendation score corresponded to. The significance of this capability is that an agent faces real apps, not tidied-up screenshots — it has to first understand the page before it can help the user search for books, compare popularity, organize the chart, or even carry out the next action.

Next, we dropped it into a Meituan "Little Judge"-style page and had it handle a merchant appeal scenario. The page simultaneously contains user reviews, image evidence, merchant responses, and handling buttons like "user is more reasonable" or "merchant is more reasonable." For the model, this is not simple OCR — it is comprehension of a business flow: who is complaining, what the dispute is, what the evidence is, and what the platform allows next. When a multimodal agent enters a real workflow, what it encounters is precisely this kind of interface mixing text, images, judgments, and action entry points.

Switching to the Blender scene, when the user inputs "how do I delete this box," the model identifies Blender's interface structure, layers, toolbar, and current edit state, then provides the steps to delete the specified box.

Let's also look at application interface design analysis. When the user asks it to "explain what's interesting about these designs," the model identifies the information within different images, understands the relationships among design elements, and produces a professional analysis.

Another key capability of Step 3.7 Flash is enhanced networking and visual search. The problems agents encounter in real business often involve dynamic information, external materials, multi-source evidence, and incomplete inputs. When a model relies solely on its own internal knowledge, it tends to fail on timeliness and accuracy.

The "Ruishilou" demo is a typical example. The model first reads the visible clues from the user-uploaded image, generates search keywords around those clues, uses a web scraping tool to investigate external materials, and finally integrates the visual information from the image with the textual information from the web to construct a complete answer.

Here, search is no longer just returning a list of web links — it is proactively seeking, filtering, cross-referencing, and organizing evidence around the task goal. This is exactly the way of working that search agents and research agents truly need.

According to official data, Step 3.7 Flash has shown performance approaching that of larger flagship models on complex visual task benchmarks such as SimpleVQA Search and V* (Python). This means the model can keep pushing a task forward even when information is incomplete, reducing the number of unverified answers.

Running 40 Agents at Once — The True Shape of a Large Model

The difference between an agent and an ordinary chatbot is the density of invocations. A standard Q&A interaction is a single exchange, but for an agent to complete a task, it needs to repeatedly observe the environment, call tools, and read results. A coding agent reads code, modifies files, and runs commands; a search agent searches, verifies, and organizes information; an office agent handles spreadsheets, documents, and email. Once the number of calls rises sharply, model speed and cost become system-level issues.

Step 3.7 Flash adopts a sparse MoE architecture, with total parameters of 196B plus a 1.8B ViT, while its active parameters are only 11B and its peak generation speed reaches 400 Tokens/s. For high-frequency agents, coding agents, search agents, multimodal agents, and enterprise knowledge-work agents, this means more rounds of observation, invocation, and reasoning can be completed within the same amount of time.

For example, Step 3.7 Flash can build an agent cluster, with 40 virtual personas of different identities role-playing as a product review team, making parallel judgments on a product question and aggregating their preferences across 5 MVP directions in real time.

This is where the value of running agents in batch lies. In the past, the cost and latency of a single model doing a single analysis were still tolerable. But once an enterprise runs dozens of agents at the same time — each playing the role of a user, expert, salesperson, product manager, operations staff, or customer support — throughput capacity instantly becomes a prerequisite. If speed is insufficient, feedback slows down; if cost is too high, scalability simply does not hold.

Similarly, when agents collaboratively build a large knowledge graph in real time and in parallel, that too qualifies as a high-frequency, multi-step task. The model's value lies not only in generation speed but also in completing more observations, searches, and inferences per unit of time.

Let's look at another information-gathering task. We tossed it a single line: "I want to write an overview of autonomous driving, so go research four directions separately — technical routes, policies and regulations, market structure, and representative companies." This kind of task looks like mere material collection, but in practice it triggers multiple rounds of searching, source verification, content classification, and structured output. The longer the task chain and the denser the calls, the more easily differences in model throughput get amplified.

The immediate impression Step 3.7 Flash gives is speed — but the quality did not drop alongside the speed. It gathered materials across the four directions from the entire web and slotted each into its corresponding section: the technical routes was explained clearly, the policy/regulation and market-structure information was also separated out, and it did not mash different directions together into one blob. Every layer that structured output should have was there.

Worth noting is that Step 3.7 Flash delivers extremely high cost-performance per completed task, and is especially friendly to high-frequency task patterns like agents. A single agent task involves decomposition, search, web-page reading, tool calling, result comparison, and output organization, and the number of calls is far higher than a standard Q&A exchange. A difference in single-call cost gets rapidly amplified across a complete task chain.

According to official data, with Advisor Mode enabled, Step 3.7 Flash's coding capability reaches 97% of Claude Opus 4.6's, while the per-task cost is roughly one-ninth of the latter's.

It is precisely for this reason that the value of Step 3.7 Flash cannot be summed up by "fast" alone. Placed inside an agent workload, it solves three problems at once: high throughput shortens wait times, lower per-task cost underpins scalable operation, and coding capability approaching that of top-tier models gives it the chance to enter real workflows and take on sustained, complex tasks.

Furthermore, for an agent to enter a production system, what matters is the ability to call tools reliably. Step 3.7 Flash is optimized for high-reliability tool calling and orchestration. Officially, it can stably invoke APIs, browsers, terminals, Office tools, and external systems across long-horizon, multi-step agent workflows, keeping task trajectories consistent and reducing the probability of task drift and execution failure.

The official figures also disclosed several data points. Step 3.7 Flash achieved 49.5% on Toolathlon, which evaluates multi-tool collaboration; 67.1% on ClawEval 1.1, which evaluates daily autonomous task execution in real environments; and 45.8% on GDPval, which spans 44 occupational tasks. On τ²-bench Telecom, across the low, medium, and high reasoning difficulty tiers, the pass rate exceeded 98% in every case.

Of course, there is an easily underestimated condition for putting agents into production: the model has to adapt to the workflow. A model is typically placed inside a harness wrapped with prompt templates, tool protocols, a browser environment, a file system, a code execution engine, an evaluation set, a permission system, and business flows.

To this end, Step 3.7 Flash has carried out compatibility optimization for mainstream coding and agent tools including Claude Code, Kilo Code, Roo Code, OpenCode, Hermes Agent, and OpenClaw, and also supports tool-calling protocols and development chains such as MCP and Skills.

Developers can thereby easily integrate the model into existing agent frameworks without redesigning the entire flow. For enterprises, the value of adaptation is self-evident: the easier it is for a model to enter existing systems, the shorter the trial and deployment cycle and the lower the engineering cost.

Currently, Step 3.7 Flash has completed integration validation in agent and developer ecosystem projects such as Kilo Code, Nous Research, and Lemonade. StepFun is also pushing adaptation with AI infrastructure and inference platforms such as Fireworks AI, DeepInfra, and Modal Labs, and plans to integrate with overseas model aggregation and developer platforms such as OpenRouter and ZenMux going forward.

🔗 https://huggingface.co/stepfun-ai/Step-3.7-Flash

To date, StepFun has provided entry points for Step 3.7 Flash via its Model Page, GitHub, Hugging Face, ModelScope, the domestic open-platform API, the overseas open-platform API, the Studio online experience, and the StepFun AI App. These entry points serve both developer trials, enterprise API integration, and open-source ecosystem use. More importantly, Step 3.7 Flash supports both cloud and local deployment, and the official side also offers an edge multi-precision version optimized for personal workstations and local environments.

Feedback from overseas developers' hands-on testing complements the official data from a different angle. In a local MoE test comparing DeepSeek V4 Flash, Step 3.7 Flash, and Minimax M2.7, Step 3.7 Flash reached 2123.13 tok/s at agg@64, outperforming the other models.

Another developer mentioned that after writing code with Gemini 3.5 Flash, they had Step 3.7 Flash review it and were able to catch more than 7 small bugs and errors. Whether it is local throughput or code debugging, this actually shows that Step 3.7 Flash has begun entering real development flows and is being recognized by developers as a productivity tool worth using long-term.

Foundation Models Should Be Born for Agents

After experiencing Step 3.7 Flash, APPSO realized that it emphasizes engineering practicality over chasing a benchmark score in some single dimension. Multimodality, web search, tool calling, framework compatibility, local deployment, low cost, high throughput — none of these is fresh on its own, but combined they happen to fill exactly the gaps agents most need in a production environment.

This path is not flashy, but it suits the stage agents are at right now. We used to ask a model how smart it was; in the agent era, the question we really should ask is a different one: who is this model designed for?

The starting points behind these two questions differ. One means the model is optimized for humans — by default it faces a human who can read, wait, and fill in the blanks themselves. You ask one line, it answers one line, a few seconds' delay is fine, and occasional vagueness can be patched up by the human. But an agent is not like that. An agent runs continuously through cycles of observation, invocation, reasoning, and correction; the requests it sends in a day may outnumber what a person says in a year. It will not smooth things over for the model — if the model drifts, the agent drifts with it. A model optimized for humans is not necessarily fit for agents. That is also why the word "Flash" has taken on new meaning in the agent era. It is no longer just a cheap stand-in for the flagship; it has been redesigned from the ground up to fit the temperament of agents.

The characteristics of Step 3.7 Flash correspond precisely to this logic. Native multimodality, because an agent first has to see the task at hand; 400 Tokens/s, because high-frequency calls cannot tolerate slowness; tool-calling stability, because a long-horizon task breaks entirely if one link breaks; and harness adaptation, because no matter how strong a model is, it is a blank page if it cannot get into an existing system.

It was not aimed at the leaderboard — it was aimed at "how an agent can actually get work done efficiently and cost-effectively." From Step 3.5 Flash to Step 3.7 Flash, what StepFun has been strengthening along the way is in fact the same thing: making models born for agents and pushing agents toward scalable commercialization. This will also become an important evolutionary route for models going forward, and Step 3.7 Flash is not yet the endpoint. But it has shown us one shift: when evaluating models in the agent era, we should not just stare at how smart they are — we should look at whether they are willing to settle those tedious engineering accounts, one by one, and make them clear.

What actually changed the world in 1492 was not Columbus's perilous crossing, but rather the later Fluyt merchant ships being able to set out, return, load cargo, and set out again, trip after trip, steadily. The adventurer is responsible for reaching the other shore; the merchant ship is responsible for turning that shore into a route. As model competition arrives at the agent stage, the logic is similar. What truly opens the distance is not just the dazzle of a benchmark score, but the models that let agents set out again and again, arrive reliably, and settle capability into a route.

StepFun Open-Sources the Ultra-Fast MoE Model "Step-3.5-Flash": 11B Active Parameters and a Stunning 350 tokens/s Peak Speed
[Claude Opus 4.7 Review: Anthropic's Strongest AI Model Dominates Agentic Coding, Yet Draws Criticism Over Hidden Cost Increases](/blog/claude-opus-4-7-deep-dive)
2026 AI Price War: The Shock of "Ultra-Low Cost" Pushed by Chinese Players and the Truth Behind Effective Performance

Comments (0)

Share:X Hatena

Back to Blog

StepFun's Step 3.7 Flash: An Agent-Ready High-Efficiency Model at 1/9 the Cost of Claude Opus 4.6

A Flash Model is No Longer Just a Flagship Stand-In

Running 40 Agents at Once — The True Shape of a Large Model

Foundation Models Should Be Born for Agents

Related Articles

Comments (0)

Post a Comment