LudoBench: A New Playground for Testing AI's Game Smarts

AI can bluff in poker and win at chess, but how does it fare in the chaotic world of Ludo? Enter LudoBench, the latest benchmark aimed at testing large language models' (LLMs) strategic reasoning in this unpredictable multi-agent board game. It throws 480 unique scenarios across 12 decision categories at AI, challenging their ability to navigate the game's random dice rolls, tactical piece captures, safe squares, and home-path complexities.

Why Ludo?

Ludo may seem like child's play, but its blend of chance and strategy makes it a formidable test for AI. It's not just about moving pieces, it's about making the smartest move under uncertainty. LudoBench ups the ante by offering a fully functional 4-player simulator that pits Random, Heuristic, Game-Theory, and LLM agents against each other. The game-theory agent, armed with Expectiminimax search, sets a high strategic bar.

AI's Performance Under the Microscope

So, how do our AI contenders fare? Not so well, it seems. When compared to the game-theory baseline, the models hit only a 40-46% agreement rate. They reveal themselves as either 'finishers', good at completing pieces but poor at overall strategy, or 'builders', the opposite. Neither archetype captures the full game-theory approach. It's like watching a chess player who can only play half the board.

Prompt-Sensitivity: A Glaring Weakness

What's more, these models display a surprising vulnerability to prompt-sensitivity. Introduce a grudge scenario with the same board state, and their behavior shifts. How can we trust AI in real-world applications if a simple change in context throws them off? If they can't handle a board game, how ready are they for the complexities of the real world?

While LudoBench offers a lightweight and interpretable framework, it exposes AI's current limits. Models are still fumbling when the stakes involve uncertainty and complex planning. It's a stark reminder that AI's strategic prowess has a long way to go. Until then, show me the product that actually works.

All 480 scenarios and their outputs are open for the world to see at https://anonymous.4open.science/r/LudoBench-5CBF/. So, are we ready to admit AI isn't quite the strategic genius we hoped for?

LudoBench: A New Playground for Testing AI's Game Smarts

Why Ludo?

AI's Performance Under the Microscope

Prompt-Sensitivity: A Glaring Weakness

Key Terms Explained