New Greek Benchmark Challenges Language Models

Large Language Models (LLMs) have long been trained on multilingual datasets, but Greek often gets shortchanged. The introduction of GreekMMLU, a benchmark dedicated to Greek, aims to change that narrative. This benchmark consists of 21,805 multiple-choice questions across 45 subjects. Importantly, these questions are native-sourced, capturing the authentic linguistic and cultural nuances of Greek, unlike the typical machine-translated datasets.

The Need for Authentic Evaluation

Why does this matter? Existing Greek datasets often miss the mark, as they’re usually translated from English, failing to capture the intricacies of Greek language and culture. GreekMMLU fills this gap by sourcing questions from academic, professional, and governmental exams entirely in Greek. This approach ensures a solid, contamination-resistant evaluation platform, essential for assessing the true capabilities of LLMs in understanding Greek.

The benchmark is divided into 45 subject areas, with questions annotated by educational difficulty levels ranging from primary to professional exams. Of these, 16,857 samples are publicly available, while 4,948 are reserved for a private leaderboard. This mix intends to foster both open research and controlled evaluation environments.

Performance Gaps and Insights

Evaluations of over 80 LLMs highlight significant performance discrepancies. There are stark gaps between state-of-the-art frontier models and open-weight models. More tellingly, Greek-adapted models perform distinctly better compared to general multilingual counterparts. What does this indicate? It’s clear that adaptation plays a critical role in enhancing model performance for specific languages.

The paper's key contribution: a systematic analysis of factors influencing model performance. This includes model scale, adaptation, and prompting. Such detailed insights are essential for developing models that can truly comprehend Greek text.

Looking Ahead

So, where do we go from here? GreekMMLU sets a precedent for creating language-specific benchmarks essential for fostering LLM development in underrepresented languages. It raises a pertinent question: Shouldn’t more languages receive similar dedicated resources? As LLMs continue to evolve, it’s imperative to ensure they’re trained and evaluated on datasets that truly reflect linguistic diversity.

Code and data are available at the project’s repository, encouraging further exploration and adaptation. For those invested in the future of language models, GreekMMLU is a call to action to prioritize authentic, native-sourced datasets.

New Greek Benchmark Challenges Language Models

The Need for Authentic Evaluation

Performance Gaps and Insights

Looking Ahead

Key Terms Explained