Language Matters: How AI Evaluations Shift with...

Language's role in AI evaluation is often overlooked. A recent study challenges this norm, showing that performance rankings of AI models can flip when the evaluation language changes. The paper's key contribution: demonstrating that AI performance isn't universally consistent across languages.

Varied Performance Across Languages

Researchers examined AI models using five diverse languages: English, Arabic, Turkish, Chinese, and Hindi, over 55 development tasks. They tested three AI frameworks and six judge backbones, leading to a massive dataset of 4,950 judge runs. Findings were clear: language significantly impacts AI evaluation.

For English, GPT-4o emerged as the top performer with a satisfaction rating of 44.72%. However, this superiority didn't hold in other languages. Gemini outperformed GPT-4o in Arabic with 51.72% satisfaction and in Hindi with 53.22%. These results suggest that no single model is dominant across all languages.

The Role of Localization

More intriguingly, the study reveals the critical role of localized instructions. When localization was only partial, satisfaction with Hindi dropped dramatically from 42.8% to 23.2%. This builds on prior work from the localization field, emphasizing the importance of fully adapting models to each language context.

Why should we care? The ablation study reveals the need for flexible benchmarks that consider linguistic diversity. It's no longer viable to rely solely on English benchmarks in a global AI landscape. But are developers ready to embrace this complexity?

No Universal Backbone

The study also uncovered modest agreement levels among different backbones on requirement judgments. With a Fleiss' kappa not exceeding 0.231, the findings indicate significant variance in model evaluations. This challenges the idea of a one-size-fits-all backbone for AI development.

What they did, why it matters, what's missing. Code and data are available at the project's repository, ensuring transparency and reproducibility. The key finding here's simple yet profound: language is a essential variable in AI evaluation. Ignoring it could lead to skewed perceptions of AI capabilities.

Language Matters: How AI Evaluations Shift with Linguistic Changes

Varied Performance Across Languages

The Role of Localization

No Universal Backbone

Key Terms Explained