Breaking the Chains of Dense Attention in Video Diffusion

Dense self-attention has been the long-standing chain around the neck of long-video diffusion inference. As sequence length increases, costs soar and quality stagnates, leaving many stuck in a loop of repetitive video outputs. But now, a new player is changing the game: Long Video Sparse Attention (LVSA).

The Problem with Dense Attention

Dense attention's primary flaw is its quadratic growth in compute as sequences lengthen. Beyond the training horizon, models often freeze, generating boring, repetitive content. While some state-of-the-art methods try to tackle this, they're either too pricey, requiring complete model retraining, or they just don't deliver on both performance and quality.

Enter LVSA, a training-free, model-agnostic solution that slashes compute demands by using a block-sparse attention. This method breaks away from the fixed-grid bias, which is notorious for long-range temporal hiccups in videos. The result? LVSA not only trims compute up to 3.17x on models like Wan 2.1 1.3B at a 6x horizon, but it also reduces the memory burden, allowing for extended generation horizons previously unthinkable on a single GPU.

Why LVSA Matters

If you're asking, "Why should I care?" consider the sheer efficiency gains. LVSA delivers up to 2.41x speedups over rival approaches like RIFLEx and an impressive 3.27x over UltraViCo on Wan 2.1 1.3B. That's not just a tweak. It's a revolution in how we approach video diffusion.

LVSA's prowess isn't limited to traditional GPUs. The results extend across platforms, including NPUs, showing up to 2.71x speedups on Wan 2.2 A14B. This means more flexibility and broader applicability for developers and researchers everywhere.

A Fresh Take on Quality Evaluation

LVSA not only tackles efficiency but also quality. With VQeval, a new tool developed to properly score video failures, LVSA ensures a fair evaluation, unlike the previously used VBench-Long, which often rewarded loopy video errors. LVSA maintains quality at the training horizon and even improves it at extended lengths.

But who benefits from this innovation? The real winners are those looking to push beyond the limits of current video generation without sacrificing quality or skyrocketing costs. LVSA proves that you don't need to break the bank to break through performance barriers.

AI, where breakthroughs are often hailed without scrutiny, it's vital to ask, "Whose data? Whose labor? Whose benefit?" LVSA offers a glimpse into a future where efficiency and accessibility aren't mutually exclusive. Yet, as with any innovation, the benchmark doesn't capture what matters most. It's about power, not just performance.

Breaking the Chains of Dense Attention in Video Diffusion

The Problem with Dense Attention

Why LVSA Matters

A Fresh Take on Quality Evaluation

Key Terms Explained