Multimodal Models: The Overlooked Imbalance in Visual Tasks
Multimodal language models are underperforming in visual tasks, with language representations overshadowing vision. A new approach highlights this imbalance.
Multimodal language models are often touted for their ability to process both text and visual inputs. However, recent findings reveal a glaring imbalance: these models significantly underperform on visual perception tasks. But why is this the case?
The Centroid Conundrum
The paper, published in Japanese, reveals an innovative method to probe this issue: centroid replacement. By collapsing each token to its nearest K-means centroid, researchers discovered a structural asymmetry across seven models from three architecture families. Notably, erasing text centroid structures resulted in a 4 times greater drop in accuracy compared to visual centroid structures. This clearly indicates that language representations dominate vision, even in tasks requiring visual reasoning.
Decoding the Asymmetry
But what can be done about it? The researchers proposed a unique solution: text centroid contrastive decoding. By contrastively decoding against a text-centroid-erased reference, accuracy on individual tasks improved by up to 16.9%. This isn't a trivial gain.
Interestingly, the intervention's effectiveness varied depending on the training approach. Standard fine-tuned models benefited more, with an average gain of 5.6%, compared to a mere 1.5% for preference-optimized models. The data shows that perhaps, the way we train these models might be skewed, favoring text over visuals.
The Bigger Picture
So, why should this matter to the AI community? It's simple. If we're designing models that are meant to understand and interpret the world as humans do, shouldn't they excel in both language and visual tasks? The benchmark results speak for themselves. The imbalance isn't just a minor hiccup. It's a clear signal that our multimodal training strategies might need an overhaul.
What the English-language press missed: this modal competition isn't just an abstract concept. It's localized, correctable, and doesn't require retraining. This makes it a diagnostic signal, one that could guide future improvements in multimodal training. If multimodal AI is to achieve its full potential, addressing this imbalance is key.
The question for researchers and developers now is: will they adjust their approach to training these models, or will they continue to let language overshadow vision? One thing’s for sure, the future of multimodal models hangs in the balance.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
AI models that can understand and generate multiple types of data — text, images, audio, video.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.
The basic unit of text that language models work with.