Why Large Language Models Fail at Basic Agreement Tasks

Large language models (LLMs) are touted for their ability to synthesize and generate human-like text. However, their performance falters basic agreement tasks in a team setting. A recent study scrutinized how these models perform in a Byzantine consensus game, a scenario where multiple agents must agree on a value despite potential adversarial behavior.

Testing Consensus

In this study, researchers simulated LLM-based agents in a setting where they had no stakes or preferences over the final outcome, focusing purely on their ability to reach consensus. The results were telling. Even in apparently benign conditions, the models struggled to agree. When the group size increased, their performance sharply declined. Introducing just a few Byzantine agents, those that act adversarially, further crippled their success rates.

Why Agreement Matters

Why should we care about LLMs failing at consensus tasks? These models are increasingly deployed as components of larger systems that require coordination, from automated customer service to multi-agent systems in logistics. If they can't reliably agree under controlled settings, how can they be trusted in more complex real-world applications? The benchmark results speak for themselves.

Failures and Their Implications

The study found that failures were mainly due to lost liveness, with issues like timeouts and stalled convergence, rather than errors in the value itself. This suggests that the models fundamentally lack the emergent capability to coordinate effectively. The paper, published in Japanese, reveals that achieving reliable agreement isn't yet a dependable capability of current LLM-agent groups, even in scenarios where one might expect them to excel.

Rethink Deployments?

Western coverage has largely overlooked this essential finding. As the deployment of LLMs continues to scale across industries, it's clear that relying on them for solid coordination is premature. The question is, how much longer will it take for these models to reach the reliability we need? Until then, caution and skepticism should guide their implementation in systems where consensus is key.