Why Language Models Struggle With Teamwork

In the race to develop autonomous agents, language models must learn to coordinate effectively over long horizons. However, current evaluations often focus on short, single-agent tasks or rigid multi-agent settings, leaving a critical gap in understanding their ability to function as a team. This is where the alem benchmark comes into play.

The Alem Benchmark

Developed as a JAX-based tool, alem is a unique benchmark designed for open-ended multi-agent coordination. It simulates a world reminiscent of Craftax dynamics, embedding tasks like exploration, crafting, trading, and combat. Alem pushes language models to navigate long-horizon survival scenarios, testing their ability to allocate roles, communicate, and execute shared plans.

Testing the Limits

Thirteen modern large language models (LLMs) were evaluated in zero-shot scenarios within homogeneous teams, with MARL-trained agents used as benchmarks. The results were telling. Current LLM agents averaged a meager 6% normalized return on alem tasks, highlighting a significant coordination bottleneck. However, not all models performed equally. While the Gemini-3.1-Pro-High model approached the performance of MARL agents in the toughest coordination settings, the GPT-5.4-High model excelled in base-task rewards but faltered in coordination.

Coordination vs. Individual Competence

The data shows a critical insight: individual task prowess doesn't equate to coordination competence. Communication emerged as the most significant factor in successful coordination, with memory and reasoning playing supporting roles in maintaining multi-step plans. The competitive landscape shifted as these findings underscored the necessity for models that can work together, not just excel alone.

So, why should anyone care? In an era where AI systems are expected to handle increasingly complex tasks, understanding and overcoming coordination challenges is critical. The market map tells the story. A model's ability to communicate and coordinate will define its utility in real-world applications. Are we ready to invest in developing these skills, or will we continue to prioritize individual performance over collective success?

The Road Ahead

Alem provides a controlled environment for addressing these challenges, offering a testbed for advancing agents' capabilities in communication and role allocation. As AI continues to evolve, the ability to coordinate effectively may just be the next frontier for language models. It won't happen overnight, but the groundwork is being laid.