Why Median Beats Mean: A New Look at Language Model Evaluation
Mean cross-entropy is outdated. Median CE is proving to be the real star in evaluating language models. Time to rethink how we measure success.
Mean cross-entropy (CE) has long been the go-to metric for validating language models. But guess what? It's not always the accurate measure of quality during training. Shocking, right?
The Qwen2.5 Surprise
Take Qwen2.5-1.5B SFT. In the synthetic fact-learning phase, the mean CE skyrockets after initial learning. However, the held-out fact-recall accuracy stays steady at its peak. So, what's going on? The mean CE's rise doesn't reflect the actual model performance. It's misleading.
Sources confirm: Mean CE is dropping the ball. Instead, median CE seems to be more in tune with real task performance.
Top-K Distillation Drama
Then there's the top-K distillation on TinyStories. As K decreases, the median CE improves, while the mean CE gets worse. The Top-5 student model even surpasses its teacher on median CE, earning the highest LLM-judge score, despite having the worst mean CE. And just like that, the leaderboard shifts.
This is wild. How can a supposedly inferior student outperform its teacher? Simple. Because median CE is a better indicator of performance.
What's the Real Story?
Analyzing the training reveals that the empirical per-token CE distribution is reshaped. In top-K distillation, a smaller K results in a distribution with more mass at the extremes. It decreases the median and increases the mean. Meanwhile, in Qwen SFT, the bulk of data saturates quickly, and the tail grows longer in the second half of training.
The verdict is clear. Task-evaluation metrics are more sensitive to the bulk of data, not the tail. This changes how we should be validating models.
The Way Forward
So, what do we do? It's time to report percentile CE summaries alongside the mean. Use concordance among them as a tool to track distribution reshaping. It's a low-cost diagnostic for when mean and median especially disagree on model selection.
In the end, if you're still clinging to mean CE as your validation guide, you might want to rethink that position. The labs are scrambling, and median CE could be the new sheriff in town.
Get AI news in your inbox
Daily digest of what matters in AI.