Cracking the Code: AI's Struggle with Math Reasoning Benchmarks
Korean college math test data reveals AI's alignment failures. Can AI match human reasoning? The results show surprising patterns.
Math reasoning benchmarks have exploded but still miss one essential element: difficulty signals based on real human performance. Enter KCSAT-ML, a benchmark spanning a decade of the Korean College Scholastic Ability Test (KCSAT) with 664 math problems and a core set of 339 items equipped with official error rates from thousands of students. This initiative pairs with Difficulty-aligned Reasoning Gain (DRG), a metric revealing if a model's mistakes align with human difficulty.
AI's Patterns of Failure
The results across various Vision and Language Models (VLMs) and Language Models (LLMs) through OCR aren't flattering. First, low-budget accuracy nosedives on problems where human error peaks, regardless of model size. Second, test-time scaling (TTS) shows a linear increase in token usage tied to human error rates, yet accuracy gains are anything but linear. Finally, within a single model family, TTS can show a dual nature: struggling with the toughest items while overcomplicating simpler ones. This exposes a core alignment flaw.
Why It Matters
Here's the kicker: models with similar accuracy can perform entirely different on a DRG scale. One might falter on complex problems humans also find tough, while another excels at the complex but stumbles over what humans find simple. Accuracy alone masks these vital distinctions. In a world where AI is expected to replicate or even surpass human reasoning, these findings suggest we're not there yet. Slapping a model on a GPU rental isn't a convergence thesis. If AI struggles with what we find easy, how can it be truly agentic?
The Path Forward
The open sourcing of this dataset could drive home a critical message: raw accuracy isn't the only game in town. Understanding where and why models fail compared to humans can open up new pathways for model training and alignment. Are we ready to confront the real limitations of AI? If the AI can hold a wallet, who writes the risk model? It's time we ask these questions and push for genuine improvements.
Get AI news in your inbox
Daily digest of what matters in AI.