The New Battleground: Large Language Models in Formal Proofs
Large Language Models are reshaping the field of formal mathematical proofs. Gemini 3.1 Pro and Claude Opus 4.7 lead in performance, setting a new bar for efficiency and accuracy.
In recent years, Large Language Models (LLMs) have transformed from novelties to powerful tools in fields like natural language processing and, increasingly, formal mathematical proofs. Their ability to generate proofs more accurately and efficiently is creating a shift in how mathematicians and researchers approach problem-solving.
LLMs and Formal Proofs: A New Era
The competitive landscape shifted this quarter formal proofs, as LLMs like Gemini 3.1 Pro and Claude Opus 4.7 showcased their prowess. According to the data, Gemini 3.1 Pro scored an impressive 92% success rate on the miniF2F dataset using refine@32. Meanwhile, Claude Opus 4.7 wasn't far behind, achieving an 86% success rate on miniCTX.
For those pursuing projects that involve complex mathematical proofs, these numbers aren't just impressive. They're transformative. With such high accuracy rates, LLMs aren't just supporting human efforts, they're leading them.
Balancing Cost and Efficiency
Here's how the numbers stack up cost-efficiency. NVIDIA Nemotron 3 Super and GPT-OSS 120B emerged as the most cost-effective models, delivering solid accuracy at a fraction of the cost, less than $0.01 per correct proof. Valuation context matters more than the headline number when budget constraints are at play.
So, what does this mean for the field of formal mathematics? It's a wake-up call for traditionalists to adapt or risk obsolescence. As LLMs become more sophisticated and cost-effective, the traditional methods of proof generation may soon be outdated.
The Future of Mathematical Proofs
However, beyond these numbers lies a more profound question: How will the role of human mathematicians evolve? While LLMs can churn out proofs with remarkable precision, the creative and intuitive aspects of mathematics, elements that machines still struggle to replicate, remain under human domain for now.
But as the technology progresses, it's not far-fetched to imagine a future where LLMs not only assist but possibly lead in mathematical discovery. The market map tells the story, and it's clear that the integration of AI into formal proofs is more than just a trend. It's the future.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
Anthropic's family of AI assistants, including Claude Haiku, Sonnet, and Opus.
Google's flagship multimodal AI model family, developed by Google DeepMind.
Generative Pre-trained Transformer.
The field of AI focused on enabling computers to understand, interpret, and generate human language.