Why Language Models Still Struggle to Chat and Build
CRAFT, a new benchmark, tests language models' ability to cooperate under limited info. Surprising results show bigger isn't always better in AI communication.
getting computers to talk to each other and actually collaborate, we're still hitting bumps in the road. Enter CRAFT, a new benchmark that puts large language models to the test in a multi-agent setting. It's not just about chatting. These models have to work together to build a 3D structure, each with only a piece of the puzzle.
The Challenge of Partial Information
In this setup, no single agent can see the whole picture. They all have to communicate in natural language to achieve a shared goal. It sounds simple, right? Yet, the results from CRAFT suggest otherwise. Even with a range of models, including 8 open-weight and 7 frontier models, the task is far from solved.
Size Isn't Everything
Here's where it gets interesting. The assumption that stronger reasoning equals better performance doesn't hold up. Some of the smaller open-weight models not only held their ground but sometimes outperformed their larger, supposedly smarter cousins. It turns out, just because a model can chat up a storm, doesn't mean it knows how to team up effectively.
So, what does this mean for the future of AI collaboration? Are we expecting too much from these models, or is there a fundamental flaw in how they approach tasks that require coordination? Ask the workers, not the executives. It's clear that improving individual communication skills in AI doesn't automatically lead to successful teamwork.
Unsolved Mysteries
The CRAFT benchmark breaks down where things go awry in these interactions. Failures happen in spatial grounding, belief modeling, and even in the pragmatic communication itself. It offers a taxonomy of where and how things fail, but the takeaway is clear: multi-agent coordination is still a tough nut to crack.
Automation isn't neutral. It has winners and losers. And right now, multi-agent tasks, the losers might just be the tech companies betting on AI to flawlessly handle complex interactions. The jobs numbers tell one story. The paychecks tell another.
Instead of just making models bigger and more complex, perhaps it's time to reconsider how these machines are taught to collaborate. After all, if smaller models can sometimes outperform the big guns, maybe there's something to be learned from their simplicity.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
Connecting an AI model's outputs to verified, factual information sources.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.
A numerical value in a neural network that determines the strength of the connection between neurons.