LLM Proof Verifiers: Not Quite Frontier Yet, But Close

By Rio VasquezApril 6, 2026

Frontier models are shining in math competitions, but smaller LLMs aren’t far behind. With a bit of tweaking, they can hold their own.

AI is proving its mettle in math competitions, solving problems that stump humans. Frontier reasoning models are leading the charge, but can smaller, open-source models catch up? They’re not there yet, but they’re close.

The Numbers Game

Evaluating large language models (LLMs) is more than just a checkbox exercise. We’re talking numbers. Smaller open-source models came within roughly 10% of frontier models in accuracy. That’s impressive. But consistency, they fall short by up to 25%. Consistency matters if you’re banking on these models to verify proofs accurately every time.

Why the gap? It’s not a lack of capability. These smaller models have the math chops needed. The issue? They’re struggling with generic prompts. It’s like expecting a sprinter to run a marathon without specific training.

Prompting a Way Forward

The secret sauce? Targeted prompting. A systematic prompt search helped expose the hidden capabilities of these smaller models. By using an ensemble of specialized prompts, accuracy jumped by up to 9.1%, and self-consistency surged by 15.9%. That’s not just a bump. It’s a leap.

Now, models like Qwen3.5-35B can hold their heads high, performing on par with top-tier models such as Gemini 3.1 Pro. This shows that with the right guidance, smaller models can punch above their weight.

Why This Matters

Here’s the million-dollar question: should we care? Absolutely. Democratizing AI capabilities means more people can access powerful tools without shelling out for frontier models. It’s about making new technology available to all. Because why should only the big players have all the fun?

As AI continues to integrate into everyday applications, the line between frontier and smaller models is blurring. Solana doesn't wait for permission, and neither should these models. If they can verify proofs with precision, what's stopping them from tackling other complex tasks? If you haven't bridged over yet, you're late.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.