PeRQ: A New Era for Quantization with Block Rotations
PeRQ transforms post-training quantization by redistributing activation mass before rotation, achieving significant accuracy gains even with smaller block sizes.
Quantization techniques have been at the forefront of making neural networks more efficient without sacrificing too much performance. Recent innovations in post-training quantization (PTQ) are taking a bold step with the introduction of block rotations aimed at minimizing outliers.
Cracking the Block Code
The central question: how effective are block rotations in suppressing these pesky outliers? Recent research presents the first comprehensive analysis, revealing that the geometry of the input vector fundamentally limits outlier suppression. In the worst deterministic cases, these outliers are minimized when the pre-rotation mass is evenly distributed across the blocks.
The paper's key contribution is PeRQ (Permute, Rotate, then Quantize), a PTQ framework designed to tackle this issue head-on. PeRQ redistributes activation mass using permutations before rotation, guided by insights from the new analysis. Simply put, it repositions the activation 'weight' to ensure the mass is more evenly spread out.
Introducing Greedy Mass Diffusion
PeRQ doesn't stop at simple permutation. It incorporates a greedy mass diffusion algorithm that calibrates these permutations by equalizing expected blockwise norms. The result? Improved quantization accuracy, without adding inference overhead.
Here's what's impressive: PeRQ manages to merge these permutations into model weights prior to deployment. How? By identifying permutation-equivariant regions in transformer architectures. The ablation study reveals this approach's efficiency, consistently improving accuracy across all block sizes.
PeRQ's Transformative Impact
Why does this matter? When quantizing Llama3 1B to INT4, PeRQ recovers up to 90% of full-vector rotation perplexity with a block size of 16. That's a stark contrast to the 46% recovery without permutations. The impact here could reshape how we think about PTQ frameworks.
Are traditional quantization methods now obsolete? Not entirely. But PeRQ signals a significant evolution. By cleverly redistributing activation mass, PeRQ challenges the notion that post-rotation outliers are an unavoidable consequence.
Code and data are available at the researchers' repository, inviting the community to dive deeper into these findings and build upon this promising work. As quantization strategies continue to evolve, PeRQ's insights may well become a cornerstone of future innovations.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
Running a trained model to make predictions on new data.
A measurement of how well a language model predicts text.
Reducing the precision of a model's numerical values — for example, from 32-bit to 4-bit numbers.
The process of teaching an AI model by exposing it to data and adjusting its parameters to minimize errors.