AlgoVeri: Unmasking AI's Vericoding Challenges

Vericoding, the generation of formally verified code from precise specifications, promises reliable software development. Yet, a unified evaluation for this task has been elusive. Enter AlgoVeri, the benchmark aiming to standardize the testing of AI models across different languages like Dafny, Verus, and Lean.

Revealing Performance Gaps

AlgoVeri evaluates vericoding with 77 classical algorithms, setting identical functional contracts across languages. The results are telling. In Dafny, which allows high-level abstractions and leverages SMT automation, models like Gemini-3 Flash achieve a success rate of 40.3%. However, performance plummets in Verus to 24.7%, where more stringent memory constraints exist. In Lean, demanding explicit proof construction, the rate drops further to 7.8%.

So, what does this disparity mean? The reality is, it highlights critical capability gaps in current verification systems. It's one thing to succeed in a high-level language. It's quite another when models face the rigorous demands of systems-level coding. The architecture matters more than the parameter count here.

Dissecting Compute Dynamics

Beyond raw numbers, AlgoVeri shows a stark difference in how models handle compute dynamics during tests. Gemini-3 excels through iterative repair, effectively tripling pass rates in Dafny. In contrast, GPT-OSS sees early saturation, unable to adapt past initial phases. This suggests some models possess a greater inherent flexibility, a trait invaluable for tackling complex verification tasks.

Are we overvaluing parameter counts over smarter architectural designs? The numbers tell a different story. These test-time dynamics suggest that AI's ability to adapt and refine during inference is essential.

Error Analysis: Language Design Matters

AlgoVeri's error analysis further exposes issues. Language design, it turns out, significantly affects the refinement trajectory. Dafny enables models to zero in on logical correctness, avoiding unnecessary detours. In contrast, Verus and Lean often trap models in loops of syntactic and semantic obstacles, hindering progress.

Here's what the benchmarks actually show: models must navigate an intricate balance between language-specific quirks and the broader goal of logical accuracy. It raises a pointed question: shouldn't we prioritize designing languages that make possible rather than hinder AI-driven vericoding?

In sum, AlgoVeri doesn't just offer a new benchmark. It challenges AI developers to rethink priorities. The architecture matters more than the parameter count. Stripping away marketing hype reveals the need for smarter, more adaptable AI systems that can tackle diverse coding environments.

To explore the data and evaluation tools, visit their repository on GitHub.