Multimodal Models Stumble on Human-Like Problem Solving
MazeBench reveals that high accuracy in AI doesn't equate to human-like spatial understanding. Models falter without reasoning budgets.
Multimodal models are often hailed as the next frontier in AI, promising to solve problems like humans. But the reality is starkly different when you strip away the marketing hype. Enter MazeBench, a new benchmark designed to test these models' abilities to tackle visual spatial tasks.
Benchmark Breakdown
MazeBench consists of 110 procedurally generated maze images grouped into nine categories. The benchmark evaluated 16 model configurations from major players like OpenAI, Anthropic, Google, and Alibaba. Notably, GPT-5.4 emerged on top, solving 91% of the tasks, followed by Gemini 3.1 Pro with 79%. Impressive? Not quite.
Here's what the benchmarks actually show: these models rely heavily on converting images to text grids and traversing paths through brute-force search. This method consumes an astounding 1,710 to 22,818 tokens per maze, a task humans breeze through with ease. Without additional reasoning budgets, even the best models score a dismal 2% to 12% on more challenging mazes.
The Real Challenge: Human-Like Planning
What does this mean for the dream of AI that can truly think like us? Frankly, itβs a wake-up call. When models are told not to grid or graph search, they end up doing just that. It's a default strategy, revealing their struggle to mimic human-like spatial understanding. A text-grid ablation experiment showed Claude Sonnet 4.6's accuracy jump to 80% when bypassing the visual extraction step altogether. Clearly, the architecture matters more than the parameter count understanding spatial tasks.
Implications for AI Development
Why should this matter? Because it signals a essential gap in AI's progression toward genuine human-like reasoning. Companies touting their models as capable of complex problem-solving might need to reconsider how they define success. High accuracy in AI doesn't necessarily translate to understanding like a human does.
So, the question becomes: Is it time for AI developers to pivot from scaling up models to refining their reasoning and planning capabilities? Without this shift, AI will remain an impressive token calculator rather than a genuine problem solver.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
An AI safety company founded in 2021 by former OpenAI researchers, including Dario and Daniela Amodei.
A standardized test used to measure and compare AI model performance.
Anthropic's family of AI assistants, including Claude Haiku, Sonnet, and Opus.
Google's flagship multimodal AI model family, developed by Google DeepMind.