Why AI Judges Still Can't Judge Like Us
AI models acting as judges for language tasks often fail to align with human judgment, showing significant variance in scoring and perspective. This reveals a broader challenge in AI-human alignment.
Language models as judges are becoming more common, but there's a catch. Sure, these AI judges tend to agree with each other. Yet, when compared to humans, the agreement falls short. So, what's going on here?
The Numbers Game
Researchers tested 41 different language models on four Indic datasets spanning eight languages. They looked at how these models scored compared to humans. Spoiler: the AI judges are kind of missing the mark. Their evaluation axis is nearly perpendicular to humans'. We're talking about a staggering angle difference between 87 and 89 degrees for the models versus a more reasonable 78 to 81 degrees among humans.
To put it another way, these AI judges are seeing the world through a very different lens. They're not even using half of the score range humans typically use. If you’ve ever graded a test, you know that could lead to some major misinterpretations.
Why This Matters
Here's why this matters for everyone, not just researchers. AI is increasingly stepping into roles that require nuanced judgment, think content moderation or even legal judgments. If our AI judges aren't aligning with human understanding, it raises concerns about fairness and accuracy. Think of it this way: would you want a judge who sees things from a completely skewed angle?
Fine-Tuning and the Human Gap
Efforts to fine-tune these models and optimize preferences do help in spreading the scores more evenly. The score spread improved from 0.32 to 1.08. But honestly, that doesn’t solve the angle problem. They remain stuck in that 87 to 88-degree range.
Here's the thing: only by calibrating with a small human-anchored set did they see improvements across all rubrics. A 24 billion parameter Indic judge showed a correlation score of 0.184, ahead of GPT-5.5's 0.123. But even then, they're miles away from human reliability, which sits at 0.474 on verifiable rubrics.
The Real Test
So, should we consider inter-LLM agreement as a sign of alignment with human judgment? Not unless these models can pass a geometric sanity check on their scoring subspace. Otherwise, all we're seeing is a consensus in a confined subspace, not true alignment.
In a world increasingly reliant on AI for decision-making, this is a wake-up call. If AI can't view scenarios from a human perspective, its role in critical decision-making processes should be limited until we figure this out. The analogy I keep coming back to is trying to fit a square peg into a round hole. It's not just misaligned, it's fundamentally different.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
The process of measuring how well an AI model performs on its intended task.
The process of taking a pre-trained model and continuing to train it on a smaller, specific dataset to adapt it for a particular task or domain.
Generative Pre-trained Transformer.
Large Language Model.