MathCheck Gets Schooled: Paraphrase Errors Shake Up AI Model Rankings
A paraphrase audit knocks GPT-4o off its pedestal, revealing a hidden flaw in AI evaluation. Is it time to rethink how we rank AI models?
AI model rankings just took a hit. MathCheck's latest paraphrase-quality audit peeled back the curtain on how easily rankings can shift when errors surface. Four semantically incorrect paraphrases in 129 groups don't sound like a big deal, right? But removing them toppled GPT-4o from rank 2 to rank 4, letting Claude Haiku and DeepSeek V3 shuffle ahead. It's a shakeup you wouldn't catch with single-model evaluation.
Ranking Games
This isn't just nitpicking. The audit uncovered a deeper flaw in model evaluations. Cross-model unanimity, where at least 3 out of 4 models catch errors, is a major shift. It cost less than $10 to spot these errors automatically. In a wider dataset, 47% of auto-generated paraphrases were way off the mark, a glaring issue for AI that aims to understand context.
Why does this matter? Because Claude Haiku, boasting 86% accuracy, shows a semantic consistency rate (SCR) of only 50%. That means half its theorems get different answers with paraphrase variations. The gap between accuracy and SCR across nine models is a whopping 32 points. So, standard benchmarks might look good on paper but hide significant weaknesses.
Model Bias and Benchmarks
Here's the kicker. For any ranking target over nine frontier models, there's a paraphrase weighting that can make it happen. No model Pareto-dominates all families. So, when benchmark designers choose paraphrase families, they're picking winners, perhaps without realizing it. It's the No-Free-Benchmark corollary in action.
FormInv offers a protocol to address this, claiming 100% recall in external benchmarks. Using measures like SCR and Cochran's Q test, it evaluates nine models on hundreds of Lean4-verified theorems. The question is, are we ready to trust benchmarks that can be gamed so easily?
Rethinking AI Rankings
Let's be blunt. If the model can't handle nuances in paraphrasing, it's not as smart as we think. The game comes first. The economy comes second. AI models need to understand language at a deeper level to be truly effective. Retention curves don't lie, and this audit is a stark reminder that the AI industry must improve its evaluation standards.
As AI continues to evolve, we must ask ourselves: are we rewarding true capabilities or getting dazzled by flashy stats? It's time to look beyond raw accuracy and embrace metrics that reflect real-world performance.
Get AI news in your inbox
Daily digest of what matters in AI.