Lost in Translation: The Struggle of Vision-Language Models with Indian Languages
Vision-language models stumble when tackling Indian languages, highlighting a gap in multilingual capabilities. Dravidian scripts face the hardest hit.
Vision-language models have been making strides in tackling complex reasoning tasks, but a essential gap has emerged. These models, often celebrated for their prowess in mathematical and scientific reasoning, falter when tasked with Indian languages. While English-centric evaluations show promising results, the real test lies in cross-lingual capability.
Testing Multilingual Limits
The first extensive audit of vision-language models in Indian languages reveals some stark findings. By translating 980 questions from MathVista, ScienceQA, and MMMU into six Indian languages using IndicTrans2, researchers unearthed a significant drop in accuracy. Despite impressive inter-translator agreement scores ranging from 0.79 to 0.84, the evaluation of eight vision-language models across these languages showed alarming results. The drop in accuracy varied from 9.8 to 25 percentage points compared to English, with Dravidian languages like Tamil and Kannada facing the harshest decline.
This isn't a partnership announcement. It's a convergence of language and technology testing limits. The AI-AI Venn diagram is getting thicker, yet the models haven't lived up to their multilingual promise. Aya-Vision-8B, designed for 23 languages, still stumbled, losing 28.5 percentage points when tackling Dravidian scripts. Clearly, multilingual pretraining alone doesn't equate to effective visual reasoning across diverse linguistic landscapes.
Chain-of-Thought Limitations
Chain-of-thought prompting, a method meant to enhance reasoning, backfired for Bengali and Kannada, causing performance drops of 14.4 and 11.4 percentage points, respectively. This points to a deeper issue: reasoning chains remain English-centric, failing to adapt to linguistic diversity. If agents have wallets, who holds the keys? In this case, the keys might be locked in English.
The compute layer needs a payment rail, and here it means linguistic versatility. Vision-language models are undoubtedly powerful, but their reliance on English-centric training data and reasoning approaches is a glaring limitation. We're building the financial plumbing for machines, but if these machines can't understand diverse languages, we're constructing a system on shaky foundations.
Beyond Translation
Releasing the translated benchmark and model outputs is a step forward. Yet, the question remains: how do we build models that truly understand and perform across global languages? The answer requires more than just translation. It demands a fundamental shift in how these models are trained and evaluated. It's about creating models with genuine multilingual reasoning capabilities, not just a facade of diversity.
The implications are clear. As AI continues its pervasive influence, the ability of vision-language models to work effectively across languages isn't just a nice-to-have. It's essential. Without this capability, we're limiting the potential for AI's global impact. It's time for the industry to address this gap with urgency and innovation.
Get AI news in your inbox
Daily digest of what matters in AI.