Multimodal Models Stumble on Human-Like Problem Solving

By Nadia OkoroMarch 31, 2026

MazeBench reveals that high accuracy in AI doesn't equate to human-like spatial understanding. Models falter without reasoning budgets.

Multimodal models are often hailed as the next frontier in AI, promising to solve problems like humans. But the reality is starkly different when you strip away the marketing hype. Enter MazeBench, a new benchmark designed to test these models' abilities to tackle visual spatial tasks.

Benchmark Breakdown

MazeBench consists of 110 procedurally generated maze images grouped into nine categories. The benchmark evaluated 16 model configurations from major players like OpenAI, Anthropic, Google, and Alibaba. Notably, GPT-5.4 emerged on top, solving 91% of the tasks, followed by Gemini 3.1 Pro with 79%. Impressive? Not quite.

Here's what the benchmarks actually show: these models rely heavily on converting images to text grids and traversing paths through brute-force search. This method consumes an astounding 1,710 to 22,818 tokens per maze, a task humans breeze through with ease. Without additional reasoning budgets, even the best models score a dismal 2% to 12% on more challenging mazes.

The Real Challenge: Human-Like Planning

What does this mean for the dream of AI that can truly think like us? Frankly, it’s a wake-up call. When models are told not to grid or graph search, they end up doing just that. It's a default strategy, revealing their struggle to mimic human-like spatial understanding. A text-grid ablation experiment showed Claude Sonnet 4.6's accuracy jump to 80% when bypassing the visual extraction step altogether. Clearly, the architecture matters more than the parameter count understanding spatial tasks.

Implications for AI Development

Why should this matter? Because it signals a essential gap in AI's progression toward genuine human-like reasoning. Companies touting their models as capable of complex problem-solving might need to reconsider how they define success. High accuracy in AI doesn't necessarily translate to understanding like a human does.

So, the question becomes: Is it time for AI developers to pivot from scaling up models to refining their reasoning and planning capabilities? Without this shift, AI will remain an impressive token calculator rather than a genuine problem solver.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.