XModBench: Unmasking the Flaws in Omni-Modal Models
XModBench exposes flaws in omni-modal models like Gemini 2.5 Pro, showing they struggle with spatial-temporal reasoning and cross-modal consistency.
Omni-modal large language models (OLLMs) are the hot buzz in AI, promising to unify audio, vision, and text understanding under one roof. But is this unified dream a reality? Not quite yet. A new benchmark, XModBench, is here to prove just that.
Breaking Down XModBench
XModBench is a beast of a benchmark with 60,828 questions, spread across five task families. It meticulously examines all six modality compositions, think audio, vision, text, and their combos in question-answer pairs. Why’s this significant? It slices through the complexity to assess if these models really think beyond just text or vision.
And the findings are wild. Even the strongest contender, Gemini 2.5 Pro, hits a wall. Spatial and temporal reasoning? Less than 60% accuracy. That’s not just a small gap. It’s a chasm.
The Story of Modality Disparities
Sources confirm: Modality biases are rampant. Gemini 2.5 Pro drops the ball when the same semantic content switches from text to audio. That’s a big red flag. What’s the point of a unified model if it chokes on audio? Users need consistency, and right now, OLLMs aren't delivering on that promise.
And just like that, the leaderboard shifts. What we see is a directional imbalance. Vision as context leads to lower consistency compared to text. It’s like watching a world-class sprinter stumble over every hurdle when running on turf instead of a track.
Why This Matters
This changes the landscape. Are OLLMs just a fancy label for half-baked solutions? If they can’t handle cross-modal tasks smoothly, are they really worth the hype? XModBench sets the stage for these models to prove their worth or be left behind.
The labs are scrambling to patch these gaps. But will they succeed? Can they create a truly modality-invariant model? The race is on, and the pressure is mounting.
In the end, XModBench isn’t just a diagnostic tool. It’s a wake-up call. For AI to truly be omni-modal, it’s got to step up its game. Until then, the industry must face the music.
Get AI news in your inbox
Daily digest of what matters in AI.