Rethinking Stability in Sparse Autoencoders: Misleading...

Sparse autoencoders (SAEs) have become a staple in producing interpretable features from neural network activations. If you've ever trained a model, you know the frustration when your results vary wildly due to random initialization. While many claim SAEs can solve this instability, the latest findings suggest things aren't that simple.

The Promise and Reality of Archetypal SAEs

In 2025, Fel and colleagues introduced archetypal SAEs, promising more stable dictionary learning. The idea was that these models could extract reliable concepts, reducing the dreaded polysemanticity that often plagues feature interpretation. But here's the thing: stability in these models was largely a product of identical initializations across runs. That's like claiming you've solved a puzzle without mixing up the pieces first.

Think of it this way: if everyone starts from the same point in a race, of course they'll finish more closely together. When the deterministic k-means decoder initialization was removed, these supposed stability gains vanished. So, can we really say these models are inherently more stable?

Why Stability Claims Need a Rethink

There's a important distinction in mechanistic interpretability that gets muddled in discussions about SAEs. Stability is about agreement between two independently trained models. But stabilization refers to different runs starting from random points yet converging to the same solution. This isn't just semantics. In fields like NLP, where researchers rely on stable features to claim they're reusable analytical units, getting this wrong could lead to misguided conclusions.

Our analysis also uncovered an unexpected hurdle: preprocessing-dependent cosine geometry can skew endpoint stability metrics. In plain terms, how you process your data might lead you to believe your model is more stable than it's. So, are we measuring real stability, or just the quirks of our preprocessing steps?

The Bigger Picture

Here's why this matters for everyone, not just researchers. If we're going to rely on SAEs for critical applications, we need to understand the limitations of current stability metrics. Otherwise, we're building on shaky ground. And let's face it, nobody wants a foundation that's more sandcastle than skyscraper.

Ultimately, the study emphasizes the importance of trajectory diagnostics and initialization ablations in evaluating model stability. It's not enough to just look at the end results. we've to understand the journey, too. With AI playing an ever-growing role in technology and beyond, ensuring the robustness of our models isn't just an academic exercise. It's essential for trust and reliability in AI systems.

Rethinking Stability in Sparse Autoencoders: Misleading Metrics and Real Lessons

The Promise and Reality of Archetypal SAEs

Why Stability Claims Need a Rethink

The Bigger Picture

Key Terms Explained