Why Vision-Language Models Struggle with Indian Languages
Vision-language models falter when navigating India's linguistic diversity, showing significant accuracy declines. Is English-centric AI holding us back?
Vision-language models (VLMs) are great at tackling complex reasoning tasks in English. But throw in the linguistic richness of India, and these models start to stumble. Here's what happened when researchers put them to the test across Indian languages.
The Experiment
A total of 980 questions from benchmarks like MathVista and ScienceQA were translated into six Indian languages, including Hindi and Tamil. This wasn't just about running basic translations. The process involved using IndicTrans2 and cross-verifying translations with Gemini 2.0 Flash over 50 samples per language. The result? A whopping 68,600 inference records.
Think of it this way: each linguistic switch became a hurdle. Accuracy fell by 9.8 to 25 percentage points when models shifted from English to an Indian language. Dravidian languages like Tamil and Kannada seemed to suffer even more, with an additional drop of up to 13.2 points compared to Indo-Aryan languages.
Chain-of-Thought: A Double-Edged Sword
Chain-of-thought prompting is often celebrated as a method to enhance reasoning. Not this time. In Bengali and Kannada, it actually degraded performance by 14.4 and 11.4 points, respectively. It seems these reasoning chains have their roots firmly planted in English, leaving other languages out in the cold.
Even Aya-Vision-8B, which boasts support for 23 languages, couldn't keep pace. It dropped a staggering 28.5 points on Dravidian scripts. So, what's going wrong? Is it that multilingual pretraining just can't cut it visual reasoning?
Why This Matters
Here's why this matters for everyone, not just researchers. India's tech scene is booming, and AI models need to be as diverse as the user base they serve. If your model trips over language, can it truly claim global applicability? English-centric models are a bottleneck in a multilingual world.
Let me translate from ML-speak: without addressing these language gaps, we're limiting AI's potential reach. It's like having a sports car but only driving it around the block. The potential's there, but we're not tapping into it.
So, what's the solution? Investing in more nuanced multilingual training datasets and developing language-specific reasoning frameworks could be a start. As AI continues to evolve, embracing linguistic diversity isn't just an option, it's a necessity.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
Google's flagship multimodal AI model family, developed by Google DeepMind.
Running a trained model to make predictions on new data.
The text input you give to an AI model to direct its behavior.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.