CricBench: The Unseen Challenges of Cricket Analytics for AI

Cricket, a sport beloved by billions, has ventured into the field of advanced analytics, yet remains an enigma for AI models. Despite the strides in large language models (LLMs) for Text-to-SQL tasks, the peculiar challenges posed by cricket analytics have flown under the radar. Enter CricBench, a benchmark suite seeking to illuminate these blind spots.

The CricBench Challenge

CricBench evaluates the SQL generation prowess of LLMs on cricket data across Test, ODI, T20I, and IPL formats. With a meticulously curated dataset of 2,654 instances spanning English, Hindi, Punjabi, and Telugu, CricBench presents a rigorous challenge. The evaluation of models like GPT-5 Mini, Claude Sonnet 4, and Qwen 235B reveals a fragmented landscape. Notably, GPT-5 Mini shines in Test cricket with a 12.4% DMA, while Qwen 235B takes the lead in IPL and T20I with 28.7% and 17.5% respectively. Yet, the models stumble when faced with the complexities of ODI queries, scoring a glaring 0%.

A Deep Divide

The findings of CricBench expose a stark disconnect between the syntactic validity of the models' outputs and their semantic accuracy. While execution accuracy exceeds 98%, semantic correctness languishes below 29%, with a domain gap of 37-55 percentage points when compared to existing benchmarks like BIRD. This gap underscores the superficial understanding these models have of cricket's nuanced and domain-specific requirements.

Why This Matters

So why should anyone outside the cricket fandom care? Because this isn’t just about cricket, it's about the broader implications for AI's performance in niche domains. If AI models struggle with the intricacies of cricket, what does that say about their ability to handle other specialized fields? Let's apply some rigor here. This isn't just a failing of the models but a reflection on the need for improved methodologies and a richer understanding of domain-specific challenges.

Color me skeptical, but the industry's tendency to tout exaggerated capabilities of AI without acknowledging its limitations can no longer be ignored. As we push AI to new frontiers, we must confront its shortcomings head-on. Otherwise, we risk overfitting our expectations to the hype rather than the reality.

In the end, CricBench offers more than a mere benchmark, it's a wake-up call. For AI to truly excel, it must do more than mimic human queries. It must grasp the essence of the domain it seeks to navigate. Until then, the promise of AI in specialized fields will remain just that, a promise unfulfilled.