Ancient Greek Texts: Machine Translations Falter on...

Machine translation of Ancient Greek has caught the interest of AI researchers. Recent evaluations focus on how well large language models (LLMs) such as ChatGPT, Claude, and Gemini handle this task. Their performance on the technical prose of the Greek physician Galen, from the 2nd century, presents a fascinating case study.

High Scores, Yet Room for Improvement

LLMs scored impressively on an expository text already available in English, achieving a mean Multidimensional Quality Metrics (MQM) score of 95.2 out of 100. This suggests these models are capable of handling texts with existing translations quite competently. However, the picture changes when these models encounter new material. Galen's pharmacological texts, never before translated, revealed a clear weakness. The mean score plummeted to 79.9.

Why should we care? The paper's key contribution shows that terminological density, especially when rare, predicts translation failure. LLMs struggle significantly with texts rich in uncommon terms. This isn't just an academic issue. it's a challenge for any attempt to automate translation in specialized fields.

The Problem with Rare Terms

Automated metrics fell short when translating passages with significant variance in quality. The ablation study reveals that LLMs fumbled catastrophically on two passages with extreme terminological density. In contrast, translations of other passages were close to expository text levels. The correlation between terminology rarity and failure was strikingly high, with a coefficient of -0.97. What does this mean for the future of AI translation?

It's clear that LLMs, while impressive, aren't infallible. Reliability suffers when confronted with unfamiliar words. For academia and industries where precision is critical, relying solely on current LLM capabilities is risky.

A First in Ancient Language Translation

This study marks a pioneering effort in expert human evaluation of LLM translations without existing references. The fact that no automated metric could adequately judge high-quality translations should prompt further research. Is it time to rethink how we evaluate machine translations?

In the end, while LLMs have made great strides, the gap in translating specialized texts remains significant. The future may hold more refined models, but for now, specialized human expertise is still indispensable.

Ancient Greek Texts: Machine Translations Falter on Technical Jargon

High Scores, Yet Room for Improvement

The Problem with Rare Terms

A First in Ancient Language Translation

Key Terms Explained