UrduMMLU: A Milestone for Multilingual AI Evaluation

In the vast, multilingual world, AI models often stumble over languages that aren't as widely supported, like Urdu. Enter UrduMMLU, a groundbreaking benchmark specifically tailored for evaluating models in this dynamic language. With over 230 million Urdu speakers, this initiative offers a much-needed resource for testing AI capabilities in a native context, rather than relying on translations.

Why UrduMMLU Matters

Here's the gist: UrduMMLU isn't just a run-of-the-mill benchmark. It's crafted from a whopping 26,431 multiple-choice questions across 26 subjects. These aren't just any questions, they hail from native Urdu MCQ banks and public examination PDFs. This distinction is key because previous benchmarks often miss out on the nuance and depth of region-specific content.

Unlike its predecessors, which often relied on translations, UrduMMLU covers everything from standard academic subjects to uniquely Urdu-centric topics. This means AI models are truly tested on their understanding of the language's rich context.

Performance Highlights

So, how do the AI models stack up? Among 30 language models evaluated, Gemini-3.5-Flash emerged as the frontrunner, boasting an impressive accuracy of over 90% in both English and Urdu prompts. That's nothing to sneeze at. However, the strongest open-source model lagged behind, trailing by roughly 8 percentage points.

Here's a curious finding: when these models were tested on Urdu-centered Humanities subjects, many saw their scores plummet by 25 to 40 points compared to STEM subjects. It poses a tough question: are our models biased towards technically structured content?

The Road Ahead

The results from few-shot prompts, where models are given a few examples before being tested, showed only modest improvements. It's a clear indication that while AI models have made strides, there's a significant journey ahead in truly mastering multilingual capabilities.

In plain English, UrduMMLU is a wake-up call. It tells us that while AI models are advancing, they still have a blind spot regional and culturally-specific knowledge. This isn't just about languages. It's about ensuring AI technologies serve a global audience effectively, respecting and understanding the diverse fabric of our world.

Bottom line: if AI is to be truly global, it must go beyond the dominant languages and explore into the intricacies of others like Urdu. It's not just an academic exercise. It's a necessity for equitable technology advancement.