Unmasking the Hallucinations of Audio-Visual LLMs

Large-language models (LLMs) have been making waves with their audio-visual capabilities. But there's a catch. These models, despite their apparent prowess, are prone to hallucinations, producing outputs that seem plausible but are actually unfounded. We're talking about speech-induced hallucinations that haven't been thoroughly explored until now.

The Speech-Vision Alignment Problem

The current benchmarks tend to focus on environmental sounds like a dog barking to signal events. But what about human speech? It carries rich semantics and intricate temporal structures. This is where the models falter. The new benchmark, SVHalluc, steps in to evaluate these speech-vision hallucinations from semantic and temporal angles. And guess what? The results aren't pretty for most models.

Why Should We Care?

Here's the kicker: state-of-the-art open-source audio-visual LLMs are struggling with aligning speech content with corresponding visual signals. Their accuracy is almost random across multiple tasks. That's a big red flag for anyone relying on these models for accurate audio-visual comprehension. If they can't get basic alignment right, how can they be trusted with more complex tasks?

Gemini 2.5 Pro: The Lone Star

there's a silver lining, though. Gemini 2.5 Pro has emerged as a standout, significantly outperforming its open-source counterparts. It's a testament to what can be achieved when cross-modality understanding is prioritized. But isn't it a bit concerning that only a select few models can manage this? We need more than just isolated success stories.

A Call for Better Models

These findings reveal a fundamental limitation in current audio-visual LLMs. Their failures are largely due to a limited ability in understanding cross-modality, despite excelling in single-modality perception. This gap needs addressing. If nobody would play it without the model, the model won't save it. And in this case, if the model can't see and hear clearly, what's the point?

In a world increasingly relying on AI for nuanced tasks, the demand for speech-grounded video comprehension is urgent. The models shouldn't just grind through data but understand it in a meaningful way. The industry needs to step up its game.