CheeseBench: A Rodent's Maze for AI Models

The introduction of CheeseBench offers an intriguing window into the capabilities and limitations of large language models (LLMs). Researchers have developed this benchmark to assess LLMs using nine classic behavioral neuroscience paradigms. These paradigms include tasks well-known in animal research, such as the Morris water maze and the radial arm maze, among others. Each task is firmly rooted in peer-reviewed protocols designed for rodents, providing approximate animal baselines.

LLMs in the Maze

Within this unique framework, LLMs face challenges that mirror those encountered by rodents placed in unfamiliar environments. The models receive a uniform system prompt without any task-specific instructions. They must infer goals merely from ASCII text observations and reward signals, replicating a scenario where a rodent must navigate a new apparatus.

CheeseBench evaluates six open-weight LLMs, ranging from 3 billion to 72 billion parameters, on their ability to interpret text-based ASCII renderings. These models are then compared not only against a random baseline but also against a graph-based reinforcement learning agent. Significantly, the highest-performing model, Qwen2.5-VL-7B, achieves an average success rate of 52.6% on ASCII inputs. This performance, however, still falls short when compared to the 78.9% success rate of approximate rodent baselines.

Scaling Limits and Interface Effects

While one might expect larger models to outperform their smaller counterparts, the findings here are counterintuitive. Scaling beyond 7 billion parameters results in diminishing returns. A larger context history, surprisingly, also leads to degraded performance. Furthermore, techniques like chain-of-thought prompting, typically used to enhance model reasoning, appear to hinder rather than help in this setting.

Interestingly, the combination of vision-language architecture provides an advantage at the 7 billion parameter level but becomes detrimental at 32 billion parameters. The performance variability of the same model, ranging from 20% to 57% purely based on interface parameters, underscores the complexity of the agent-plus-interface system. it's clear that the interface plays as critical a role as the model architecture itself. Why invest in bigger models if interface tweaks can yield similar gains?

Implications and Future Directions

These findings present several implications for the future of AI development. Open-weight LLMs, under the current unified zero-shot ASCII protocol, remain significantly below rodent-level performance, particularly in tasks demanding spatial navigation and state tracking within trials. For developers, this highlights the importance of optimizing model-interaction interfaces rather than focusing solely on model scaling.

Should researchers pivot toward more effective interface designs rather than simply increasing model parameters? The evidence from CheeseBench suggests a resounding yes. In a domain where achieving rodent-level intelligence remains a benchmark, understanding and refining how models parse and respond to environment cues is important.

CheeseBench: A Rodent's Maze for AI Models

LLMs in the Maze

Scaling Limits and Interface Effects

Implications and Future Directions

Key Terms Explained