LLM Proof Verifiers: Not Quite Frontier Yet, But Close
Frontier models are shining in math competitions, but smaller LLMs aren’t far behind. With a bit of tweaking, they can hold their own.
AI is proving its mettle in math competitions, solving problems that stump humans. Frontier reasoning models are leading the charge, but can smaller, open-source models catch up? They’re not there yet, but they’re close.
The Numbers Game
Evaluating large language models (LLMs) is more than just a checkbox exercise. We’re talking numbers. Smaller open-source models came within roughly 10% of frontier models in accuracy. That’s impressive. But consistency, they fall short by up to 25%. Consistency matters if you’re banking on these models to verify proofs accurately every time.
Why the gap? It’s not a lack of capability. These smaller models have the math chops needed. The issue? They’re struggling with generic prompts. It’s like expecting a sprinter to run a marathon without specific training.
Prompting a Way Forward
The secret sauce? Targeted prompting. A systematic prompt search helped expose the hidden capabilities of these smaller models. By using an ensemble of specialized prompts, accuracy jumped by up to 9.1%, and self-consistency surged by 15.9%. That’s not just a bump. It’s a leap.
Now, models like Qwen3.5-35B can hold their heads high, performing on par with top-tier models such as Gemini 3.1 Pro. This shows that with the right guidance, smaller models can punch above their weight.
Why This Matters
Here’s the million-dollar question: should we care? Absolutely. Democratizing AI capabilities means more people can access powerful tools without shelling out for frontier models. It’s about making new technology available to all. Because why should only the big players have all the fun?
As AI continues to integrate into everyday applications, the line between frontier and smaller models is blurring. Solana doesn't wait for permission, and neither should these models. If they can verify proofs with precision, what's stopping them from tackling other complex tasks? If you haven't bridged over yet, you're late.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
Google's flagship multimodal AI model family, developed by Google DeepMind.
The text input you give to an AI model to direct its behavior.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.
Reasoning models are AI systems specifically designed to "think" through problems step-by-step before giving an answer.