LLMs Face Off in the Islamic Knowledge Arena
IslamicMMLU sets a new benchmark challenging language models in Islamic disciplines. The results? A wild ride from 32.4% to 99.3% accuracy.
JUST IN: Large language models (LLMs) are getting a litmus test Islamic knowledge. Introducing IslamicMMLU, a fresh benchmark packing a punch with 10,013 multiple-choice questions. These aren't just any questions. They're split into three heavyweight tracks: Quran, Hadith, and Fiqh, each demanding its own set of skills.
The Contenders
We're talking 26 LLMs under the microscope here. Their performance? It's all over the map. Accuracy swings from a lowly 39.8% to a jaw-dropping 93.8%. That's by Gemini 3 Flash leading the pack. And just like that, the leaderboard shifts.
But let's zoom in. The Quran track shows the most dramatic spread with accuracy ranging from 99.3% to 32.4%. It's like watching a sports league where some teams are crushing it while others can't find their footing. Meanwhile, the Fiqh track isn't just about scores. It's testing for biases across different Islamic schools of thought. A novel twist that reveals which models might be leaning a little too hard on certain interpretations.
Arabic Models: Mixed Bag
What about the Arabic-specific models? Here's the kicker: they aren't living up to the hype. Despite being tailored for the language, they're lagging behind the frontier models. It's a stark reminder that sometimes specialization doesn't guarantee supremacy.
So, why does this matter? Because in a world increasingly relying on AI for religious and scholarly advice, understanding where these models stand is essential. Are they really ready to be trusted with such delicate subjects?
Final Thoughts
The real question is, how will the labs respond? With the public leaderboard now available, the pressure's on to refine these models. Because right now, the inconsistencies are too big to ignore. This changes AI's role in handling complex cultural and religious information.
Get AI news in your inbox
Daily digest of what matters in AI.