What They Are
A benchmark is a test — a standardized set of questions or tasks with known correct answers. You run an AI model through the test, measure how many it gets right, and use the score to compare against other models. It's like a standardized exam for AI.
When OpenAI releases GPT-5 and says it scores 92% on MMLU, they're saying the model answered 92% of a specific set of multiple-choice questions correctly. This gives you a concrete, comparable number instead of vague claims about being "better."
Why They Matter (and Why They Don't)
Benchmarks matter because they provide objective, reproducible measurements. Without them, model comparisons would be nothing but marketing claims. They give researchers and developers a shared language for discussing progress.
But they're also deeply flawed. A model can score well on benchmarks while being terrible at real-world tasks. Benchmark scores don't capture creativity, nuance, reliability, or user experience. Companies optimize for benchmark performance, sometimes at the expense of actual usefulness. And some benchmarks have been "saturated" — models score so high that the benchmark no longer distinguishes between them.
Key Benchmarks
MMLU (Massive Multitask Language Understanding): 14,000 multiple-choice questions across 57 subjects — from astronomy to law to medicine. Tests breadth of knowledge. Top models score 85-90%+.
HumanEval: Tests coding ability. The model receives a function description and must write working code. Measures practical programming skill. GPT-4 scores around 85-90%.
GPQA (Graduate-level Q&A): Very hard questions written by PhD-level experts. Tests deep reasoning in specific domains. Even experts score only ~65% outside their field.
ARC (AI2 Reasoning Challenge): Science questions requiring reasoning, not just recall. Harder than MMLU because answers can't be found by pattern matching.
MT-Bench and Chatbot Arena: Human preference-based evaluations. Real users compare model outputs side by side and pick the better one. Chatbot Arena's ELO rankings are widely considered the most meaningful measure of real-world quality.
SWE-bench: Tests ability to solve real GitHub issues — reading codebases, understanding bugs, and writing patches. A harder, more realistic coding benchmark than HumanEval.
The Problems with Benchmarks
Contamination: If benchmark questions leak into training data, the model has effectively memorized the test. Scores go up, but capability doesn't. This is a real and growing problem.
Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Labs optimize for benchmark scores, sometimes through tricks that don't improve actual usefulness.
Narrow measurement: Most benchmarks test specific, isolated skills. Real-world tasks require combining many skills, handling ambiguity, and recovering from mistakes — things benchmarks don't capture well.
Saturation: When multiple models score 90%+ on a benchmark, it stops being informative. The field needs harder, more discriminating tests continuously.
Where to Go Next
- → Large Language Models — the models being benchmarked
- → How AI Models Are Trained — what benchmarks measure
- → AI Safety — safety evaluations
- → Open Source AI — comparing open models