ComBench: Testing AI's Creative Math Limits

Combinatorial reasoning has long been a cornerstone of Olympiad-level math, demanding both creativity and deep structural insight. Yet, even the most advanced AI models today struggle to match human ingenuity in this domain. Enter ComBench, a groundbreaking benchmark designed to test large language models on Olympiad-level combinatorics.

Why ComBench Matters

ComBench isn't just another test. It's a comprehensive assessment tool featuring 100 human-annotated problems, split into analysis-centric and construction-centric challenges. The distinction is important. While analysis-centric problems demand rigorous proofs, construction-centric ones require tangible solutions and justifications for correctness. That's where the real challenge lies.

The paper's key contribution is its unique evaluation protocol. By combining rubric-guided proof grading with deterministic construction verification, ComBench reveals where current AI models falter. It exposes the disconnect between proof quality and construction validity, a gap that plays out in the results of leading AI models.

AI Models Under the Microscope

Notably, experiments on both open- and closed-source models show that there's much room for improvement. The strongest model so far achieves 65.4% overall average and 75.3% in the Best@4 category. Is that good enough? Hardly.

Interestingly, the study highlights how different models excel in distinct areas. Kimi-K2.6, for instance, lags behind GPT-5.5 in proof grading but outperforms it in construction-centric tasks. This builds on prior work from other benchmarking studies, showing that rigorous proof reasoning and constructive realization aren't the same capability.

The Uncharted Frontier of AI

Existence and Construction problems remain the toughest nut to crack. They consistently stump even the best models. This indicates a significant gap in AI's current capabilities, one that demands more than just incremental improvements.

So, why should readers care? In an era where AI is rapidly evolving, understanding its limitations is just as important as celebrating its achievements. ComBench is a litmus test for AI's creative reasoning, highlighting where machines still lag behind human problem-solving prowess. It's a call to action for researchers to push the boundaries of what's possible, to explore new methodologies that could bring machines closer to human-level reasoning in combinatorics.

ComBench: Testing AI's Creative Math Limits

Why ComBench Matters

AI Models Under the Microscope

The Uncharted Frontier of AI

Key Terms Explained