Are Machine Translations Good Enough for Multilingual AI?
New research shows LLMs excel in multilingual reasoning using both human and machine translations, but language choice affects cost and accuracy.
The introduction of mmPISA-bench could be a big deal in multilingual AI benchmarking. This high-quality test is based on the OECD's Programme for International Student Assessment (PISA) and packs a punch with 25 multiple-choice questions. These aren't just any questions. They demand reasoning to answer accurately.
Multilingual Tests: Human vs. Machine
The benchmark is impressive in its scope, offering questions in 43 languages with both human and machine translations. That's 2,150 data points in total. Why does this matter? Because accurate machine translations can open doors for broader AI testing without costly human translation efforts.
Two leading LLMs were put to the test across these languages. The results were compelling. These models demonstrated reasoning skills that matched human test-takers. But, here's the catch: performance varied across languages.
Machine Translations Hold Their Ground
Do machine translations undermine accuracy? Surprisingly, no. The research showed that machine-translated questions didn't hurt accuracy compared to their human-translated counterparts. This suggests that high-quality machine translation can suffice for large-scale evaluations. That's big news for anyone concerned about the scalability of multilingual AI testing.
The Cost of Language
Here's where it gets interesting. Language choice impacts not just accuracy but also cost. Some languages are more expensive token usage and yet less accurate. This raises a key question: Is it worth investing in certain languages if the return on performance isn't there?
Strip away the marketing and you get a clear picture. The architecture matters more than the parameter count. Yet, the cost-effectiveness of language choice can't be ignored. The reality is, as models continue to evolve, these factors will play a significant role in determining their utility and deployment.
This research challenges assumptions about the necessity of human translations in AI evaluations. While the findings highlight the robustness of LLMs in multilingual reasoning, they also underscore the need for strategic decisions around language choices. AI developers must weigh the benefits against the costs as they expand their models' reach globally.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
A value the model learns during training — specifically, the weights and biases in neural network layers.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.
The basic unit of text that language models work with.