AdaCodec: Revolutionizing Video Compression for AI
AdaCodec challenges the status quo in video processing by prioritizing meaningful frame changes over redundant data. This method slashes token usage while enhancing performance.
Video content often suffers from redundancy, with adjacent frames showing similar objects and backgrounds. Traditional video multimodal large language models (MLLMs) have been inefficiently processing these frames as independent RGB images, leading to repeated content across frames. That's where AdaCodec comes in, offering a smarter way to handle video data.
A New Approach to Video Encoding
AdaCodec introduces the concept of a predictive visual code. Instead of treating each frame as a standalone entity, AdaCodec selectively sends full reference frames only when the scene is unpredictable from the existing context. Otherwise, it transmits a compact summary of changes with P-tokens, focusing on motion and prediction residuals.
This approach marks a significant departure from the Qwen3-VL-8B per-frame RGB baseline. AdaCodec reduces the visual token budget dramatically while outperforming existing benchmarks. Even more impressive, it achieves this at just 1/7th of the token budget, demonstrating that less can indeed be more.
Benchmark Busting Performance
Across eleven benchmarks, AdaCodec doesn't just keep pace. it raises the bar. It surpasses the 224k baseline on all long-video benchmarks using only 32k tokens. This is a massive leap in efficiency, particularly on the five general-video benchmarks where the average score improves and the time-to-first-token drops significantly from 9.26 seconds to 1.62 seconds.
The implications for AI processing and storage are profound. What good is slapping a model on a GPU rental if it can't efficiently handle input data? AdaCodec shows us that smart encoding isn't just about reducing data, but about prioritizing meaningful information.
Real-world Impact
Why does this matter? As video content continues to dominate digital landscapes, the demand for efficient, fast processing is only increasing. AdaCodec isn't just a technical achievement. it's a wake-up call for the industry. If video models can learn to focus on what's truly important, we save time, energy, and resources.
Will other models take note and adapt? They'd better, if they want to stay relevant and effective. Show me the inference costs, and then we'll see who's really leading the charge in video AI.
Get AI news in your inbox
Daily digest of what matters in AI.