CoEval: Rethinking How We Evaluate Language Models

Evaluating language models without task-specific labeled data has always been a challenge. But the real kicker? Many public benchmarks have likely leaked into models during pretraining, turning scores into mere reflections of memorization. Enter CoEval, a groundbreaking framework aiming to shake up how we choose our AI tools.

What's CoEval?

CoEval is an open-source framework designed to create custom benchmarks from scratch. It's all about task descriptions and domain specifics, with no human labels. The beauty lies in its ability to generate new items each run, ensuring a contamination-free evaluation. No more relying on old benchmarks with their inherent biases.

And CoEval doesn't just stop at benchmark generation. It employs a cross-family judge ensemble to rank language models without human raters. Validated by ground truth data, CoEval boasts a ranking accuracy of 86%. That's a serious claim, but who benefits?

The Power Shift

This isn't just about performance. It's a story about power. By making evaluation cheap and accessible, CoEval democratizes AI model ranking. Anyone can regenerate their custom leaderboard for their specific application, costing just $5.89 for 7,978 evaluations. It's practically peanuts.

However, one has to ask, will democratizing benchmarks change the power dynamics in AI development? Or will the big players still dominate because they've the resources to build better models regardless of more accessible evaluations?

Breaking Down the Numbers

In a study covering four tasks, CoEval produced 7,978 evaluations for just under $6. That kind of efficiency is unheard of. And with zero verbatim 13-gram overlap with major benchmarks, it's clear the framework means business.

But there's a catch. A single judge in the panel can be anti-correlated with ground truth, but the ensemble as a whole never is. It suggests that diversity in judging panels, not their size, is key to reliable results.

This approach could signal a shift in how we think about AI development. Instead of relying on potentially biased or outdated benchmarks, teams can now tailor evaluations specific to their needs.

The Road Ahead

As AI continues to evolve, frameworks like CoEval might just level the playing field. But it also raises questions about how we measure success in AI. Are we really aiming for models that understand our tasks, or just ones that score well on tests?

The benchmark doesn't capture what matters most. It's time to look closer at what we prioritize in AI development. And always, ask who funded the study.