Rethinking Vision-Language Models: A Better Path Forward

Vision-language models (VLMs) are expanding the frontiers of AI, but they come with their own set of challenges. One of the most significant issues is the high cost of decoder inference. This arises from the massive number of visual tokens, ranging from hundreds to thousands, that these models project images into. The computational expense is felt both in attention calculations and memory usage.

The Flaws in Current Token Reduction

Current methods to tackle this issue mostly adhere to a rank-and-remove approach. They score visual tokens, select a compact subset, and permanently discard the rest. The paper's key contribution: this irreversible method is fragile. Why? Because the importance of visual tokens shifts across the depth of the decoder. Tokens deemed irrelevant at one point might become important in later layers, especially for grounding-sensitive tasks. So, why are we treating them as expendable?

Enter Reroute: A Smarter Approach

This is where Reroute comes in. It's a training-free plug-in that proposes a radical shift from removal to recoverable routing. At each stage, selected visual tokens pass through the decoder blocks, while the rest are temporarily bypassed, allowing them to re-enter the candidate pool at the next routing decision. This approach cleverly reuses existing attention-score ranking rules and stage-wise schedules, maintaining the theoretical TFLOPs and KV-cache budget class of the pruning method it augments.

The ablation study reveals that across multiple variants, including FastV, PDrop, and Nüwa on LLaVA-1.5 and Qwen backbones, Reroute not only sustains general VQA performance but also enhances grounding under aggressive token reduction. The key finding: VLM token reduction should shift from a view of irreversible pruning to one of adaptable, recoverable routing.

Why This Matters

What does this mean for the future of vision-language models? By adopting Reroute, we're not just trimming the fat, we're making token reduction a dynamic and reversible process. Imagine having the flexibility to revisit and use previously discarded data. This builds on prior work from the VLM community, but takes a bold step forward. Can we afford to ignore such adaptability? In a field where every data point could be important, we can't.

For those interested in diving deeper, the code and data are available at https://github.com/elmma/mllm-reroute/. It's a compelling case for a methodology that marries efficiency with flexibility. The question isn't whether we should adopt Reroute, but how soon.

Rethinking Vision-Language Models: A Better Path Forward

The Flaws in Current Token Reduction

Enter Reroute: A Smarter Approach

Why This Matters

Key Terms Explained