Quantized Language Models: What's Really Surviving the Cut?

deploying large language models, quantization is the go-to move. But just because a model's performance metrics, like perplexity, remain intact, it doesn't mean everything is hunky-dory under the hood. The real question: Are the models' brains, the interpretable features, still functioning after the quantization chop?

Model Compression: A Double-Edged Sword

Sparse autoencoder (SAE) features are what we're talking about. They're like the neural pathways that keep the model smart. But shrink the model down, and you're playing with fire. The study checked out Pythia-70M and Gemma-2-2B models, dialing down from INT8 to INT4. They found that features don't just vanish, they degrade. At INT6, Pythia-70M kept 62.4% of its active features, while Gemma-2-2B held onto 51.3%. Not awful, but are we okay with those odds?

Let's get real: Task metrics like perplexity might not wave the red flag when features melt away. On Gemma-2-2B, INT7 levels up perplexity while trashing 18.7% of features. That's like putting a fresh coat of paint on a crumbling wall. Looks good, but doesn't solve the structural issues.

Predictable Vulnerabilities

Here's where it gets spicy. Full-precision stats alone can predict feature survival with AUCs from 0.92 to 0.97. Peak activation lit up as the strongest predictor. So, why aren't we using this intel before the quantization axe drops? If you're ignoring these predictions, you're not just playing roulette with model integrity, you're throwing cash down the drain.

Quantization and magnitude pruning share a nasty secret. Both damage overlapping features sets. We're talking about a Jaccard overlap of 0.79 to 0.86 and a Spearman correlation of 0.98 for damage scores. Translation: They're compressing models in ways that leave them vulnerable in the same spots.

The Bottom Line

If you're content with just behavioral parity, you're missing the bigger picture. This isn't just a tech issue, it's a call for comprehensive feature-level audits. As models get faster and leaner, we need to stop assuming they stay smart. Quantization's speed boost isn't worth the price if it's built on crumbling foundations. Solana doesn't wait for permission, and neither should we holding our models accountable.