A standardized test used to measure and compare AI model performance. Examples include MMLU for general knowledge, HumanEval for coding, and ARC for reasoning. Important for tracking progress, but models can be optimized specifically for benchmarks, making real-world performance the better measure.
Benchmarks are standardized tests that let us compare different AI models on the same tasks. They're the SATs of the AI world — imperfect, sometimes gamed, but still the best common yardstick we've got. Popular ones include MMLU (testing knowledge across 57 subjects), HumanEval (coding ability), and HellaSwag (common-sense reasoning).
The problem with benchmarks is that they can become targets rather than measures. When labs optimize specifically for benchmark performance, the scores go up but real-world usefulness doesn't always follow. This is called Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Some models score brilliantly on MMLU but stumble on basic tasks that aren't covered.
The AI community constantly creates new benchmarks as models get better at old ones. When GPT-4 scored above most humans on the bar exam, the goalpost moved to harder tests. There's a growing push for more holistic evaluation — testing not just accuracy but also robustness, fairness, and how well models handle edge cases they've never seen before.
"The new model scores 89% on MMLU, but benchmarks only tell part of the story — you need to test it on your actual use case."
Massive Multitask Language Understanding.
The process of measuring how well an AI model performs on its intended task.
A mathematical function applied to a neuron's output that introduces non-linearity into the network.
An optimization algorithm that combines the best parts of two other methods — AdaGrad and RMSProp.
Artificial General Intelligence.
An autonomous AI system that can perceive its environment, make decisions, and take actions to achieve goals.
Browse our complete glossary or subscribe to our newsletter for the latest AI news and insights.