LLM Judge Panels: More Models, Less Independence
A recent study finds LLM-judge panels fall short of ideal independence, revealing flaws in shared mistakes across models. Single models often perform just as well.
The reliance on large language models (LLMs) as judge panels for aggregating votes is under scrutiny. A new study highlights a critical flaw: these panels don't provide the independence one might expect. Instead of a diverse evaluation, they often echo the same errors.
What's the Problem?
Research involving nine frontier LLMs from seven model families showed surprising results. Tested on three natural language inference datasets, each with 100 human annotations per item, these panels offered only about two independent votes' worth of information. That's a massive drop from the expected nine votes. Why? Because models consistently make the same mistakes.
Crucially, this lack of independence has measurable consequences. The panel's actual accuracy lags 8-22 percentage points behind what true independent voting would achieve. The kicker? A single well-performing model often rivals or surpasses the panel's performance.
Why It Matters
Does adding more judges help? Not really. Even with smarter aggregation techniques, the gap closes by just 11% at best, and that's when armed with correct answers. The study uses the Kish effective sample size and a Condorcet null model to illustrate that the deficit persists across different prompts, temperatures, reasoning chains, and in tasks like RewardBench.
The paper's key contribution is a revelation: the problem lies with correlated judges, not the aggregation algorithms. So, scaling up these panels isn't a substitute for genuine independence in evaluations.
What's Next?
What does this mean for the future of LLM judge panels? The findings suggest a reconsideration of current practices. If more models don't equate to better results, what's the path? Should the focus shift to improving individual model accuracy instead?
In a field that often equates more with better, this study challenges a core assumption. The reliance on sheer numbers without considering qualitative independence could be leading researchers astray. The ablation study reveals a stark reality: sometimes less truly is more.
Get AI news in your inbox
Daily digest of what matters in AI.