Revolution in Reasoning: Why 'Thinking with Video' Outpaces Text and Images
Sora-2's video-centric approach reshapes multimodal reasoning, outperforming state-of-the-art models in both vision and text tasks.
AI, the paradigms of 'Thinking with Text' and 'Thinking with Images' have long dominated, pushing large language models (LLMs) and Vision-Language Models (VLMs) to new heights. Yet, these methods face undeniable limitations. Images, bound to single moments, fail to capture dynamic processes. Text and vision, when treated as separate modalities, hinder a truly unified understanding. Enter 'Thinking with Video,' a game-changing approach that might just redefine multimodal reasoning.
Unpacking 'Thinking with Video'
The proposal here's bold: use video generation models, like Sora-2, as a unified medium for reasoning. By using continuous video frames, this approach aims to seamlessly blend the strengths of both visual and textual reasoning. The Video Thinking Benchmark (VideoThinkBench) was developed to test this new paradigm, covering tasks from visual puzzles to complex mathematical questions.
Results from VideoThinkBench are telling. Sora-2 matches the performance of top-tier VLMs in vision-centric tasks and even outpaces GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, its 92% accuracy on MATH challenges and 69.2% on MMMU questions demonstrate a significant leap forward. This isn't just incremental progress. it's a potential shift in how we approach AI reasoning.
Why Video?
Why does 'Thinking with Video' matter? Quite simply, it's about capturing the full spectrum of human experience. Dynamic processes, continuous changes, and nuanced interactions are best captured through video. If AI can process such comprehensive data, its reasoning capabilities could become far more sophisticated.
But there's more. The promise of video as a medium for unified reasoning isn't just theoretical. It's practical. Sora-2's performance hints at a future where AI doesn't just see or read, it understands in a way that's closer to human cognition. Still, one can't help but wonder: if the AI can hold a wallet, who writes the risk model?
The Road Ahead
There's no denying that 'Thinking with Video' presents challenges. The computational demands of processing video data are substantial, and not every application will benefit equally. However, the potential rewards are significant. As this approach develops, it could redefine the benchmarks for AI understanding altogether.
So, is 'Thinking with Video' the future? It's too early to declare it a universal solution, but it's undeniably a step forward. The intersection is real. Ninety percent of the projects aren't. But those that are could transform AI reasoning.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
Generative Pre-trained Transformer.
AI models that can understand and generate multiple types of data — text, images, audio, video.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.