Why LLMs Struggle with Teamwork in Open-Ended Tasks

As language models become more autonomous, their ability to work collaboratively over extended periods becomes key. But here's the rub: most current evaluations don't really test these skills in tandem. Instead, they focus on either short interactions or highly structured settings. Enter a new benchmark called alem, which aims to shake things up by testing multi-agent coordination in a more dynamic and open-ended context.

The Alem Benchmark

Built using JAX, alem creates a world where language models must coordinate in tasks like exploration, crafting, trading, and combat. Think of it this way: alem is like a virtual survival game, but the key players are language models. The benchmark offers procedurally generated tasks with varying coordination difficulties. It’s a fresh playground for testing how well these models can plan, communicate, and execute shared strategies.

When 13 modern language models were evaluated on alem, the results were stark. The models only managed to achieve about 6% of the normalized return on average. This isn't just a number. It highlights a glaring gap in current capabilities. If you've ever trained a model, you know hitting such a low benchmark is like getting an F on a test you thought you'd prepared for.

Coordination vs. Individual Competence

Here's where it gets interesting. In the toughest coordination scenarios, some models like the zero-shot Gemini-3.1-Pro-High performed surprisingly well, approaching the performance of Multi-Agent Reinforcement Learning (MARL) agents trained for a billion steps. But even these promising results didn't translate to across-the-board success. For instance, GPT-5.4-High excelled in base-task rewards but stumbled when it came to coordination rewards. This is a clear signal: being good at individual tasks doesn't guarantee you're a team player.

So, what's the bottleneck? Communication. The analysis shows that effective communication is the biggest hurdle to overcome for better coordination. Memory and reasoning abilities are helpful, but mainly to keep multi-step plans on track.

Why This Matters

The analogy I keep coming back to is this: imagine a soccer team with star players who can't pass the ball. Individually, they're great. But in a game, they'd flop. That's the current state of many language models in coordination tasks. Here's why this matters for everyone, not just researchers. If these models are going to be useful in real-world applications that require teamwork, like disaster response or complex negotiations, they need to get better at playing nice with others.

So, what's next? Alem isn't just a test, it’s a challenge to develop agents that can truly communicate, allocate roles, and execute shared plans effectively. And for developers and researchers, it’s an opportunity to reshape how we think about AI capabilities.

In the end, are language models ready to take on the world together? Not yet, but benchmarks like alem are a step in the right direction.