Can AI Really Master Mathematics? New Benchmark Says Not Yet

Mathematics might be the ultimate test of intelligence. Yet, despite the power of large language models (LLMs), their ability to perform research-level math is questionable. Enter LiveMathematicianBench, a new benchmark aimed at evaluating these capabilities in a rigorous and realistic manner.

A New Era for Math Benchmarks

LiveMathematicianBench steps away from synthetic data and past contamination issues. Instead, it uses recent theorems from arXiv papers published after model training cutoffs. This means it evaluates models on problems they haven't seen before, pushing beyond pattern recognition.

The benchmark introduces a complex taxonomy of 13 logical theorem types. These range from implications to uniqueness, allowing for a nuanced look at how well models understand different reasoning forms. The paper's key contribution: a proof-sketch-guided distractor pipeline that creates tricky yet invalid answers, testing the models' depth of understanding.

Current Results: Subpar Performance

How do the models fare? Not well. Gemini-3.1-pro-preview, the top performer, scores just 43.5%. Under more stringent evaluation that resists substitution, results plummet. GPT-5.4 leads with a mere 30.6%, while Gemini-3.1-pro-preview drops to 17.6%, below even random guessing.

The ablation study reveals something important. When models have access to proof sketches, their accuracy improves. This shows they can use high-level strategies to reason, but without such guidance, their performance falters. What does this tell us? Current models might excel at memorizing patterns, but genuine mathematical reasoning remains elusive.

Why This Matters

The implications are clear. For AI to truly assist in scientific workflows, it's got to do more than echo back known patterns. Real-world applications demand understanding and creativity. Can AI ever reach this level in mathematics? The jury's still out, but LiveMathematicianBench offers a path to finding out.

Code and data are available at the project's repository, inviting further exploration and improvement. As researchers and developers, the challenge is set. The key finding here isn't just about model performance, but about the journey towards a more intelligent AI.

So, why isn't AI good at math yet? Perhaps it's because math requires not only computation but intuition, a leap AI hasn't made. The stakes are high, and the field is ripe for innovation. Will your model be the one to conquer the LiveMathematicianBench challenge?