Why Omni LLMs Still Struggle with Multimodal Safety

By Tanya KimuraJune 5, 2026

MCBench sheds light on the limitations of Omni Large Language Models in multimodal safety. Despite their ability to process vision, audio, and text, these models often falter when tasked with nuanced safety judgments.

Artificial intelligence is breaking new ground, but it's not always smooth sailing. Enter MCBench, a benchmark that's shaking up our understanding of how Omni Large Language Models (LLMs) handle safety across different modes. Think of it as a stress test with 1196 scenarios that challenge these models to integrate vision, audio, and text for safety assessments.

Safety Isn't So Simple

The MCBench creators set the stage with four safety categories and paired each unsafe scenario with a nearly identical safe one. The goal? To see if these models can spot the subtle differences. Spoiler alert: they're not acing the test. Even the top models are having a tough time with nuanced threats that don't jump out from a screen or speaker. On the flip side, they fare better when there's a glaring alarm bell like a loud noise or a flashing light.

: are these Omni LLMs really as advanced as we think? Or are we expecting too much too soon? Current models seem to hit a wall when asked to perform cross-modal reasoning in settings where safety is critical.

What's Missing?

The big reveal from MCBench is that Omni LLMs are good at extracting information from individual modalities but trip up when asked to integrate these cues. It's like being a jack of all trades, master of none. Why should you care? Because as AI continues to infiltrate industries like healthcare and transportation, effective cross-modal reasoning is essential. A missed cue could mean the difference between a minor inconvenience and a major safety lapse.

For developers and researchers, this is a call-to-action. The architecture and training strategies for these models need a serious upgrade. The builders never left, and now's the time for them to double down on creating more solid frameworks that can handle these complex tasks.

The Road Ahead

So where do we go from here? Improving these models isn't just a technical challenge, it's a necessity. As AI becomes more embedded in our daily lives, ensuring these systems can make accurate, nuanced decisions across multiple modalities could be the linchpin for broader adoption. Gaming is AI's best Trojan horse, and this is what onboarding actually looks like.

In the end, MCBench isn't just a test, it's a wake-up call. The meta shifted. Keep up or risk being left in the dust. With the stakes this high, it's not just an academic exercise. It's a real-world imperative.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.

Why Omni LLMs Still Struggle with Multimodal Safety

Safety Isn't So Simple

What's Missing?

The Road Ahead

Key Terms Explained