CheeseBench: A Rodent's Maze for AI Models
CheeseBench evaluates large language models on classical behavioral neuroscience tasks. It reveals scaling limits and interface dependencies.
The introduction of CheeseBench offers an intriguing window into the capabilities and limitations of large language models (LLMs). Researchers have developed this benchmark to assess LLMs using nine classic behavioral neuroscience paradigms. These paradigms include tasks well-known in animal research, such as the Morris water maze and the radial arm maze, among others. Each task is firmly rooted in peer-reviewed protocols designed for rodents, providing approximate animal baselines.
LLMs in the Maze
Within this unique framework, LLMs face challenges that mirror those encountered by rodents placed in unfamiliar environments. The models receive a uniform system prompt without any task-specific instructions. They must infer goals merely from ASCII text observations and reward signals, replicating a scenario where a rodent must navigate a new apparatus.
CheeseBench evaluates six open-weight LLMs, ranging from 3 billion to 72 billion parameters, on their ability to interpret text-based ASCII renderings. These models are then compared not only against a random baseline but also against a graph-based reinforcement learning agent. Significantly, the highest-performing model, Qwen2.5-VL-7B, achieves an average success rate of 52.6% on ASCII inputs. This performance, however, still falls short when compared to the 78.9% success rate of approximate rodent baselines.
Scaling Limits and Interface Effects
While one might expect larger models to outperform their smaller counterparts, the findings here are counterintuitive. Scaling beyond 7 billion parameters results in diminishing returns. A larger context history, surprisingly, also leads to degraded performance. Furthermore, techniques like chain-of-thought prompting, typically used to enhance model reasoning, appear to hinder rather than help in this setting.
Interestingly, the combination of vision-language architecture provides an advantage at the 7 billion parameter level but becomes detrimental at 32 billion parameters. The performance variability of the same model, ranging from 20% to 57% purely based on interface parameters, underscores the complexity of the agent-plus-interface system. it's clear that the interface plays as critical a role as the model architecture itself. Why invest in bigger models if interface tweaks can yield similar gains?
Implications and Future Directions
These findings present several implications for the future of AI development. Open-weight LLMs, under the current unified zero-shot ASCII protocol, remain significantly below rodent-level performance, particularly in tasks demanding spatial navigation and state tracking within trials. For developers, this highlights the importance of optimizing model-interaction interfaces rather than focusing solely on model scaling.
Should researchers pivot toward more effective interface designs rather than simply increasing model parameters? The evidence from CheeseBench suggests a resounding yes. In a domain where achieving rodent-level intelligence remains a benchmark, understanding and refining how models parse and respond to environment cues is important.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
A value the model learns during training — specifically, the weights and biases in neural network layers.
The text input you give to an AI model to direct its behavior.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.