Why LLMs Are Still Lost in a Maze: Spatial Reasoning Woes
LLMs struggle with spatial reasoning, showing significant performance gaps. Gemini-2.5-Flash leads, yet fails in visual formats.
JUST IN: The spatial reasoning of foundation models is under the microscope, and it's not looking good. Researchers have put these models through their paces using maze tasks, aiming to evaluate their ability to construct internal spatial world models. The results? Let's just say the labs are scrambling.
The Maze Challenge
In a series of experiments involving models like Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, significant discrepancies in spatial reasoning became evident. Gemini-2.5-Flash managed to ace smaller mazes with 80-86% accuracy using tokenized adjacency representations. But switch to visual grid formats, and the performance plummets to 16-34%. That's a wild difference, showing these models' spatial reasoning is anything but format-agnostic.
Deeper Problems Uncovered
Here's the kicker: despite achieving 96-99% in semantic coverage within reasoning traces, these models can't take advantage of that understanding spatial computations. Each question is treated independently. They aren't building the cumulative spatial knowledge we'd hope for. So, what's the point of having near-perfect semantic understanding if it doesn't translate to spatial tasks?
Why It Matters
This isn't just an academic exercise. The implications for deploying foundation models in tasks that require spatial abstraction are massive. If you're banking on these models for applications like autonomous navigation or complex planning, think again. The leaderboard in AI might be shifting, but not in the direction some expected.
Ultimately, these findings raise a big question: Are current foundation models fundamentally limited in their ability to develop reliable spatial world models? Until they can consistently handle spatial reasoning across formats, they're more like one-trick ponies than the universal solvers we dreamed of.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
Anthropic's family of AI assistants, including Claude Haiku, Sonnet, and Opus.
Google's flagship multimodal AI model family, developed by Google DeepMind.
Generative Pre-trained Transformer.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.