Benchmark2026-06-01

Why LLMs Struggle with Video Games: The Gap Between Coding and Gameplay

The Current State of LLMs in Gaming

While Large Language Models (LLMs) have demonstrated breathtaking capabilities in programming and creative writing, they hit a wall when placed in the dynamic environments of video games.

Julian Togelius, Director of the Game Innovation Lab at NYU and co-founder of Modl.ai, notes that LLMs have performed "absolutely suck" (his words) across various game-based AI benchmarks. In some instances, LLMs have shown lower performance than even the simplest search algorithms, highlighting a significant gap in their ability to handle interactive environments.

"Coding" vs. "Gameplay": A Fundamental Divide

There is a paradoxical capability gap: an LLM using tools like Claude or Cursor can generate a playable version of a classic like Asteroids from a single prompt. However, that same model cannot effectively play the game it just created and iteratively improve the design based on its own gameplay feedback.

Togelius describes coding as a "well-behaved game." In programming, the task is explicit and the reward—whether the code runs or fails—is immediate and detailed. This makes it an ideal domain for LLM optimization. In contrast, spatial reasoning and real-time decision-making in video games differ fundamentally from the static pattern recognition that language models excel at.

The Wall Between Specialized AI and General Game AI

We have already seen massive success with specialized AI. Google's AlphaZero, for example, completely dominates humans in Go and Chess. However, these systems require individual retraining and bespoke designs for each game, which is the opposite of the general-purpose approach LLMs take.

The General Video Game AI Competition has spent seven years testing agents against ten new games per round, proving that building a truly universal game-playing AI is an immense challenge. Togelius warns against the prevailing assumption that because we can build AI to excel at specific games, we are close to building an AI that can play any game.

The Paradox of Game World Complexity

One of the most intriguing arguments put forward by Togelius is that game worlds can actually be more diverse than the real world. In reality, the laws of physics are constant. In gaming, every title introduces entirely different mechanics, input methods, and rules. This extreme variance makes it incredibly difficult for a single model to adapt universally.

Despite these hurdles, there is progress. Reports indicate that Gemini 2.5 Pro successfully navigated Pokémon Blue, suggesting that as models evolve, their capacity to adapt to complex, rule-based environments is gradually improving.

Conclusion

The struggle LLMs face in video games isn't a lack of raw computing power, but rather a deficiency in spatial reasoning and the ability to process immediate, granular reward systems. For the next generation of AI agents to achieve true versatility, they will need more than linguistic reasoning; they require a form of "embodiment"—an interface capable of real-time learning and optimization within dynamic environments.

Comments (0)

Share:X Hatena

Back to Blog