Why AI Judges Still Can't Judge Like Us

Language models as judges are becoming more common, but there's a catch. Sure, these AI judges tend to agree with each other. Yet, when compared to humans, the agreement falls short. So, what's going on here?

The Numbers Game

Researchers tested 41 different language models on four Indic datasets spanning eight languages. They looked at how these models scored compared to humans. Spoiler: the AI judges are kind of missing the mark. Their evaluation axis is nearly perpendicular to humans'. We're talking about a staggering angle difference between 87 and 89 degrees for the models versus a more reasonable 78 to 81 degrees among humans.

To put it another way, these AI judges are seeing the world through a very different lens. They're not even using half of the score range humans typically use. If you’ve ever graded a test, you know that could lead to some major misinterpretations.

Why This Matters

Here's why this matters for everyone, not just researchers. AI is increasingly stepping into roles that require nuanced judgment, think content moderation or even legal judgments. If our AI judges aren't aligning with human understanding, it raises concerns about fairness and accuracy. Think of it this way: would you want a judge who sees things from a completely skewed angle?

Fine-Tuning and the Human Gap

Efforts to fine-tune these models and optimize preferences do help in spreading the scores more evenly. The score spread improved from 0.32 to 1.08. But honestly, that doesn’t solve the angle problem. They remain stuck in that 87 to 88-degree range.

Here's the thing: only by calibrating with a small human-anchored set did they see improvements across all rubrics. A 24 billion parameter Indic judge showed a correlation score of 0.184, ahead of GPT-5.5's 0.123. But even then, they're miles away from human reliability, which sits at 0.474 on verifiable rubrics.

The Real Test

So, should we consider inter-LLM agreement as a sign of alignment with human judgment? Not unless these models can pass a geometric sanity check on their scoring subspace. Otherwise, all we're seeing is a consensus in a confined subspace, not true alignment.

In a world increasingly reliant on AI for decision-making, this is a wake-up call. If AI can't view scenarios from a human perspective, its role in critical decision-making processes should be limited until we figure this out. The analogy I keep coming back to is trying to fit a square peg into a round hole. It's not just misaligned, it's fundamentally different.