Revolutionizing Quantization: QAM-W Challenges the Status Quo
QAM-W, a new quantization method, redefines efficiency. It maintains accuracy while using fewer bits, challenging existing models like SmoothQuant.
race to optimize model efficiency, a new player has emerged: QAM-W, or Quadrature Amplitude Modulation for Weights. It presents a novel approach to post-training quantization, notably preserving the intrinsic structure of weight rows, a feat that many scalar quantizers struggle to achieve.
Breaking Down QAM-W
QAM-W stands out by L2-normalizing each row, applying block-Hadamard rotations, and then pairing weights into 2D coordinates. These are then quantized using a Lloyd-Max codebook, tuned to the unit circular Gaussian. This isn't just technical jargon for the sake of it. The reality is, this method allows for activation-aware per-channel scaling, a major shift in maintaining model accuracy.
Here's what the benchmarks actually show: In trials across five large language models, ranging from 1.1 billion to 13 billion parameters, the activation-aware variant of QAM-W managed to maintain perplexity rates within ±0.4% of BF16 WikiText-2 benchmarks. And it did this using 32% fewer weight bits compared to the SmoothQuant W8A8 model. That's a significant reduction without sacrificing performance.
Why It Matters
Strip away the marketing, and you get a quantization method that genuinely innovates. The joint 2D coding approach outperforms traditional polar coding, offering an improvement of 2 to 15 percentage points in perplexity at the same bitrate. Moreover, the correlation between the Kullback-Leibler divergence and perplexity change remains nearly perfect, with a Spearman correlation of 0.99 across 37 method-model pairs.
So, why should this matter to you? At a more aggressive 3.5 bits per weight (bpw) setting, QAM-W proves competitive on architectures that tolerate quantization. Even under strict constraints of 4 bpw, it's clear there's competition, as the rotated-codebook approach, QTIP, outperforms QAM-W. Yet, QAM-W shines in the 5-6 bpw range, maintaining quality while conserving bandwidth.
The Road Ahead
Frankly, the architecture matters more than the parameter count. As models grow, the demand for efficiency grows with them. QAM-W presents a compelling case for adopting smarter, more efficient quantization methods that don't compromise on accuracy. The question is no longer whether we can reduce model sizes, but how intelligently we can do it without losing performance. QAM-W might just be the key to unlocking that balance.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A value the model learns during training — specifically, the weights and biases in neural network layers.
A measurement of how well a language model predicts text.
Reducing the precision of a model's numerical values — for example, from 32-bit to 4-bit numbers.
The process of teaching an AI model by exposing it to data and adjusting its parameters to minimize errors.