AI Leaderboards: Trust Issues and The Real Decision Makers

Leaderboards have become a staple in the AI world, reigning as the go-to method for ranking large language models (LLMs). Despite their popularity, many researchers can't help but eye these rankings with skepticism. It's a classic case of trust issues. So why do they still rely on them?

The Skeptic's Paradox

A recent analysis, featuring insights from eight researchers across four computer science niches, uncovers a paradox. Researchers, even while doubting leaderboard accuracy, lean on them for quick decision-making. Yet, the true MVP in model selection seems to be the peer networks, not the digital leaderboards. It appears that human input still trumps algorithmic rankings.

Interestingly, when asked about their preferences, researchers showed a clear inclination towards arena-based leaderboards, where human votes drive the results, over static benchmark leaderboards.

The Culture Divide

Not all disciplines feel the leaderboard's pull equally. For instance, Natural Language Processing (NLP) researchers face pressure to compare with state-of-the-art models. Meanwhile, folks in Human-Computer Interaction (HCI) and Systems/Privacy seem unfazed. It highlights how disciplinary culture, rather than individual attitudes, shapes one's engagement with leaderboards.

But amidst these differing influences, a unifying call emerged: cost transparency. Seven out of eight researchers demanded it. In a field where resources can dictate success, knowing the cost upfront isn't just helpful, it's essential.

Redesigning the Future

So, what's the fix? If leaderboards are going to stick around, they need a makeover to better serve researchers. Think task-specific score breakdowns, cost integration, and revealing the demographics behind voters. This isn't just about being transparent, it's about aligning with how researchers actually use these tools.

But here's the real question: Should we just scrap leaderboards altogether if they're so distrusted? Maybe not. But if AI's evolution is going to be guided by more than flashy rankings, it's time to focus on utility over illusion. The builders never left, and they're demanding better tools.

AI Leaderboards: Trust Issues and The Real Decision Makers

The Skeptic's Paradox

The Culture Divide

Redesigning the Future

Key Terms Explained