LLM Evaluation: When Machines Judge Machines

Large Language Models (LLMs) are increasingly being deployed as judges in the space of automatic Natural Language Generation (NLG) evaluation. However, the reliability of these models on a granular level remains a topic of scrutiny. A recent diagnostic analysis applied to the SummEval dataset has shed light on the inconsistencies plaguing these systems.

Transitivity Analysis Reveals Inconsistencies

The study unveils a fascinating dual-pronged approach to diagnosing these issues. First, a transitivity analysis exposes per-input inconsistencies that are typically masked by low aggregate violation rates, ranging from 0.8% to 4.1%. Shockingly, 33% to 67% of the documents were found to exhibit at least one directed 3-cycle inconsistency. This suggests that while overall statistics might appear stable, the detailed evaluation results tell a different story.

Prediction Sets and Reliability

Secondly, the research introduces split conformal prediction sets over 1-5 Likert scores for theoretically guaranteed coverage. The width of these sets serves as a per-instance reliability indicator with a correlation coefficient of +0.576 across 1,918 instances, indicating significant reliability issues. Crucially, these prediction sets show consistent cross-judge agreement, with average correlations between 0.32 and 0.38. This consistency suggests that the set width captures document-level difficulty rather than noise from individual judges.

Criterion Over Judge: A Clear Verdict

The diagnostics converge on a critical finding: the evaluation criterion is more impactful than the judge itself. relevance, the judgments are most reliable, with an average set size of approximately 3.0. Coherence is judged moderately reliably, with an average set size of about 3.9. However, the metrics for fluency and consistency remain disappointingly unreliable, averaging a set size of around 4.9. : is it time to rethink how we evaluate these systems?

The paper's key contribution is the spotlight it shines on the varied reliability of different criteria. It poses a fundamental challenge to the current practices in LLM evaluations. With the release of all code, prompts, and cached results, researchers have the tools they need to probe deeper into these issues.

Ultimately, this research highlights a pressing need for more reliable evaluation frameworks. As LLMs continue to play a vital role in generating and assessing text, the stakes are high. Can we afford to rely on unreliable metrics when machines are increasingly tasked with judging their own kind?

LLM Evaluation: When Machines Judge Machines

Transitivity Analysis Reveals Inconsistencies

Prediction Sets and Reliability

Criterion Over Judge: A Clear Verdict

Key Terms Explained