Streamlining Vision-Language Models: The Case for Token Pruning
Spatio-Temporal Token Scoring (STTS) dramatically boosts efficiency in video-based vision-language models, pruning 50% of tokens with minimal performance loss.
Vision-language models (VLMs) are the workhorses behind many of today's AI-driven video applications. But, they're often bogged down by inefficiencies. Enter Spatio-Temporal Token Scoring (STTS), a new approach that promises to change the game by trimming the fat.
The Token Pruning Problem
Token pruning is important for optimizing VLMs, especially when dealing with video data. Videos are notorious for temporal redundancy. Previously, token pruning either targeted only the vision transformer (ViT) or focused solely on the language model (LLM). Both had their limitations. Let me break this down.
Pruning within ViT typically worked for specific tasks like action recognition, but not for broader vision-language tasks. On the flip side, pruning within LLM left the ViT outputs untouched, leading to complex mechanisms for token selection. The numbers tell a different story with STTS.
STTS: A Breakthrough in Efficiency
STTS stands out by pruning vision tokens across both ViT and LLM without the need for complex text-conditioning or merging. It achieves this with a lightweight module that scores tokens temporally through an auxiliary loss and spatially via LLM gradients. It sounds technical, but the impact is clear.
By pruning 50% of vision tokens, STTS boosts efficiency by an impressive 62% during both training and inference. And, it does this with a mere 0.7% drop in performance across 13 video QA tasks. Frankly, that's a trade-off most would jump at.
Why This Matters
Efficiency gains from STTS grow with more sampled frames per video. This means that as video content scales, STTS becomes even more valuable. Test-time scaling for long-video QA further ups the ante, adding another 0.5-1% in performance. Strip away the marketing and you see a compelling case for unified, architecture-wide token pruning.
So, why should users care? The reality is that with video data volumes exploding, every efficiency gain matters. STTS offers a way to handle more data without significant drops in performance. Isn't it time more models adopted this approach?
Get AI news in your inbox
Daily digest of what matters in AI.