QUACKing the Code: How AI is Tackling Social Deduction Games

Social deduction games are the perfect playground for AI. They involve reasoning, deception, and coordination, often pushing Large Language Models (LLMs) to their limits. Yet, despite their popularity, these games usually focus on outcomes like win rates, leaving a huge gap in understanding how well an AI's language reflects its actions. Enter QUACK, a new open-source environment designed to bridge this gap.

QUACK Explains It All

QUACK doesn't just look at who wins. It evaluates AI by examining game outcomes, behavioral paths, and the consistency of their utterances. The Statement Verification Pipeline is the real hero here. It reconstructs each AI's actions based on engine logs and checks the truth of every claim they make during a game. By doing so, it flags issues like spatial hallucinations, baseless accusations, and deceptive language.

The findings are eye-opening. For instance, even the most advanced AI models hallucinate about 15.1% of their spatial claims. More than half of their accusations? Unsupported by evidence. That's a pretty big deal when you're talking about AI that promises to understand and interact with human players.

Why Should We Care?

So why does this matter? Well, if AI can't ground its language in reality during a game, how reliable is it in real-world applications? The builders never left, and they're clearly still working out the kinks. But the need for grounded, reliable AI is becoming more critical as we integrate these systems into everyday life.

QUACK offers a toolkit, complete with logs and an evaluation framework, available on GitHub. This opens the door for developers and researchers to really dig in and refine these AIs to make them smarter and more reliable.

Looking Ahead

Here’s the kicker: if AI can’t even get it right in a controlled game environment, what are the odds it will function effectively in the chaos of real-world scenarios? The meta shifted. Keep up. QUACK may not be a final solution, but it’s a significant step forward. It’s making us look beyond mere win rates to question the depth and reliability of AI reasoning.

Gaming is AI's best Trojan horse. It’s the ultimate testing ground for AI capabilities, and QUACK is pushing the envelope. But will it be enough to curb the hallucinations and half-baked accusations?, but at least now we've the tools to find out.

QUACKing the Code: How AI is Tackling Social Deduction Games

QUACK Explains It All

Why Should We Care?

Looking Ahead

Key Terms Explained