Cracking the Code: How MLLMs Struggle with Video Narratives

Multimodal large language models (MLLMs) have certainly turned heads with their capabilities in vision-language reasoning. Yet, understanding how narratives unfold over time in videos, they seem to fall short. What’s the deal? The new NarrativeTrack benchmark aims to decode this issue, highlighting where these models shine and where they stumble.

NarrativeTrack: A New Benchmark

The researchers behind NarrativeTrack aren’t playing it safe. Instead of sticking with short video clips or broad scene-level semantics, they explore into the nitty-gritty. They break down videos into entities and their stories, using what they call a Compositional Reasoning Progression (CRP). This isn’t just about asking who did what, where, and when. It’s about seeing how these elements interact and evolve over time, across three challenging dimensions: entity existence, changes, and ambiguity.

Where Current MLLMs Stumble

Here’s what the benchmarks actually show: current MLLMs are having a tough time tracking entities through visual and temporal transitions. While general-purpose models might be great at initial perceptual grounding, they often lose the plot keeping track of entities over time. Video-specific models fare a bit better with temporal context but still hallucinate under pressure.

Frankly, the numbers tell a different story. MLLMs are caught in a tug-of-war between perceptual grounding and maintaining temporal coherence. They can identify what’s happening in a static frame, but add the element of time, and they struggle. It’s a fundamental trade-off that won’t be easy to resolve.

Why This Matters

So why should we care? The reality is, as video becomes a more prominent medium for communication, from social media to professional content, having systems that understand these narratives as humans do becomes increasingly important. Can we afford machines that hallucinate identities or lose track of continuity?

Strip away the marketing, and you get a field striving for genuine narrative comprehension. The architecture matters more than the parameter count here, and until MLLMs can effectively integrate perceptual grounding with temporal reasoning, true narrative understanding will remain elusive. The challenge is clear: integrate better, or risk becoming obsolete in an evolving video-centric world.

Cracking the Code: How MLLMs Struggle with Video Narratives

NarrativeTrack: A New Benchmark

Where Current MLLMs Stumble

Why This Matters

Key Terms Explained