BenchEvolver: Transforming Language Model Evaluation

The rapid advancement of large language models often leads to benchmark saturation. These models breeze through existing datasets, making it tough to assess their real capabilities. LiveCodeBench, for instance, sees frontier models scoring over 99% Pass@1 on easier problems, with an average above 90% across all levels. A dire need arises for more challenging datasets, but creating these typically demands significant human effort.

Introducing BenchEvolver

Enter BenchEvolver. This evolutionary framework takes a novel approach by evolving existing coding problems into trickier versions. Instead of generating problems from scratch, it transforms reference solutions with structured changes and crafts new problems and tests from these evolved solutions. This process, grounded in executable semantics, allows for scalable creation of diverse, high-quality tasks that maintain verifiable correctness.

Applying BenchEvolver to LiveCodeBench and SciCode has resulted in tasks that are notably tougher yet still valid and diverse. Importantly, even the models that help generate these tasks find them challenging. This means continuous self-improvement is possible, a significant development in AI training.

Push Beyond the Limits

With the introduction of LiveCodeBench-Plus, a benchmark with 91 evolved and difficult tasks, Pass@1 scores show a dramatic drop. From previously excessive scores, they now range between 27.5% and 62.6%. This restores a much-needed discriminatory power among solid coding models. It's a wake-up call for models previously cruising through assessments. How can we truly say a model is top-tier if it aces an outdated test?

training models like gpt-oss-20b on these evolved tasks yields impressive performance boosts. A combined seed and evolved training approach delivers +8.7 and +8.3 Pass@1 improvements on the hardest challenges, surpassing seed-only training gains by 70.7% and 34.8%, respectively.

Why It Matters

BenchEvolver isn't just about making things harder. It's about pushing the boundaries of what language models can achieve in a meaningful way. By converting saturated benchmarks into frontiers of evaluation, we're ensuring that models are tested against relevant, challenging tasks. Code and data are available for those who want to explore further.

So, is this the dawn of a new era where AI models are truly put to the test, or will this just be another cycle of catch-up? Only the next wave of AI development will tell. But one thing is sure, BenchEvolver sets a new standard for how we evaluate AI's real-world problem-solving capabilities.

BenchEvolver: Transforming Language Model Evaluation

Introducing BenchEvolver

Push Beyond the Limits

Why It Matters

Key Terms Explained