Why Video AI Struggles with Long-Form Content

Video AI is hitting a wall processing long-form content. The challenge isn't recognizing events in a video. it's finding them in the first place. When researchers put current Video-Language Models (Video-LLMs) to the test using a new hour-scale benchmark called ExtremeWhenBench, the results were surprising and a bit disheartening.

The Grounding Problem

ExtremeWhenBench includes 2,273 queries over 194 videos, each averaging 75.7 minutes and some stretching up to nine hours. The task is simple yet daunting: for each natural-language query, locate the time interval in the video where it happens. But while this might sound like a straightforward task, Video-LLMs are falling short. The press release says AI transformation. The employee survey said otherwise.

The crux of the issue is search, not recognition. In shorter videos, AI can easily localize events. But in longer videos, finding the relevant section is like searching for a needle in a haystack. The current models just aren't cutting it. A frame-level retrieval baseline, on the other hand, outperformed the Video-LLMs, highlighting how much work is needed to improve AI's search capabilities in long-form video.

Why It Matters

This isn't just a tech problem. It's a warning sign for anyone relying on AI for media analysis, content moderation, or even surveillance. If today's AI can't efficiently sift through hours of content, how can we trust it to handle the massive influx of video data we generate every day?

We need to ask ourselves: Are we focusing too much on recognition when the real bottleneck is search? The gap between the keynote and the cubicle is enormous. The retrieve-then-ground approach, which mirrors the retrieve-then-read method in open-domain question answering, seems like a promising path forward. It recovered performance by 6.7 times compared to the traditional monolithic Video-LLM approach.

The Road Ahead

So where do we go from here? It's clear that AI's potential to revolutionize video analysis is still largely untapped, but not for lack of trying. The search capability needs an uplift, and perhaps it's time to shift our priorities. Will companies heed this wake-up call? Or will they continue to invest in tools that leave much to be desired, especially long-form content?

I talked to the people who actually use these tools. Their frustrations are palpable. Management bought the licenses. Nobody told the team how to make the best of them. Until AI can accurately and efficiently sift through our video content, its promise will remain just that, a promise.

Why Video AI Struggles with Long-Form Content

The Grounding Problem

Why It Matters

The Road Ahead

Key Terms Explained