LLM Benchmarks: A Flawed Measure of True Capability?

The obsession with benchmarks in evaluating large language models (LLMs) often masks deeper issues in understanding true model capabilities. Recent findings suggest that the traditional approach to benchmarking may be more of a mirage than a reliable guide.

Benchmark Blind Spots

Evaluating LLMs through benchmark suites might not be as straightforward as it seems. These suites, with their effective dimensionality ranging from 2.86 to 4.80, reveal that the so-called blind spots in structural evaluation dwarf the visible gaps by factors of ten or more. The Hausdorff distance, a measure of disparity, exposes that the difference between two models' apparent capabilities can be alarmingly underestimated.

So, what they're not telling you: these benchmarks might overstate a model's competence. When statistical noise is overwhelmed by blind spots by up to 127 times, it paints a picture of uncertainty in our assessment tools.

Shuffling the Leaderboard

In practice, leaderboards might offer less stability than they appear. Under a chi-squared projection model, empirical evidence shows a high likelihood of swapping top model positions. A staggering 92% of the time, the top-performing model shifts places, and nearly three out of five top contenders alter their rank in trials. This should make us question the reliability of these rankings. Are they truly indicative of a model's performance?

Let's apply some rigor here. A certain greedy algorithm boasts a stable core of four benchmarks, yet seven of twelve can secure 90% of coverage. However, even with this seeming robustness, the retention rates when transferring models between temporal quarters range from 93% to 97%. It raises the pressing question: Isn't this volatility indicative of deeper flaws in our evaluation methodology?

Evaluating Evaluations

What emerges as a striking insight is the eigenstructure's predictive power in determining which evaluations are indispensable and which aren't. A removal disruption correlation of -0.69 suggests some tests are indeed irreplaceable. Yet, the addition of fresh external evaluations offers new insights, with a correlation of +0.38.

In an independent theoretical stride, a longstanding problem posed by Gardner in 1995 has been resolved. The minimax rate, expressed as Theta(R/(kappa m^(2/(D-1)))), holds across general dimensions, presenting an optimal recovery theory. But, color me skeptical: are these theoretical advances genuinely translating into practical improvements in LLM assessments?

Ultimately, this analysis highlights an unsettling reality. While benchmarks provide a semblance of order, their underlying assumptions and variability call their credibility into question. Are we ready to accept that our current evaluation systems might be fundamentally flawed?

LLM Benchmarks: A Flawed Measure of True Capability?

Benchmark Blind Spots

Shuffling the Leaderboard

Evaluating Evaluations

Key Terms Explained