Why Language Models Still Struggle to Chat and Build

getting computers to talk to each other and actually collaborate, we're still hitting bumps in the road. Enter CRAFT, a new benchmark that puts large language models to the test in a multi-agent setting. It's not just about chatting. These models have to work together to build a 3D structure, each with only a piece of the puzzle.

The Challenge of Partial Information

In this setup, no single agent can see the whole picture. They all have to communicate in natural language to achieve a shared goal. It sounds simple, right? Yet, the results from CRAFT suggest otherwise. Even with a range of models, including 8 open-weight and 7 frontier models, the task is far from solved.

Size Isn't Everything

Here's where it gets interesting. The assumption that stronger reasoning equals better performance doesn't hold up. Some of the smaller open-weight models not only held their ground but sometimes outperformed their larger, supposedly smarter cousins. It turns out, just because a model can chat up a storm, doesn't mean it knows how to team up effectively.

So, what does this mean for the future of AI collaboration? Are we expecting too much from these models, or is there a fundamental flaw in how they approach tasks that require coordination? Ask the workers, not the executives. It's clear that improving individual communication skills in AI doesn't automatically lead to successful teamwork.

Unsolved Mysteries

The CRAFT benchmark breaks down where things go awry in these interactions. Failures happen in spatial grounding, belief modeling, and even in the pragmatic communication itself. It offers a taxonomy of where and how things fail, but the takeaway is clear: multi-agent coordination is still a tough nut to crack.

Automation isn't neutral. It has winners and losers. And right now, multi-agent tasks, the losers might just be the tech companies betting on AI to flawlessly handle complex interactions. The jobs numbers tell one story. The paychecks tell another.

Instead of just making models bigger and more complex, perhaps it's time to reconsider how these machines are taught to collaborate. After all, if smaller models can sometimes outperform the big guns, maybe there's something to be learned from their simplicity.

Why Language Models Still Struggle to Chat and Build

The Challenge of Partial Information

Size Isn't Everything

Unsolved Mysteries

Key Terms Explained