PolySAE: Unveiling Hidden Interactions in Neural Networks

In the evolving domain of neural networks, sparse autoencoders (SAEs) have been a staple for interpreting complex data. However, their limitations are becoming increasingly apparent. Traditional SAEs operate under the assumption that features combine additively. This assumption is overly simplistic. Think of it this way: can a linear model truly distinguish how 'Starbucks' emerges from the interplay of 'star' and 'coffee' features? The answer is a resounding no.

Challenging the Status Quo

Enter PolySAE, a novel approach that extends the capabilities of traditional SAEs by incorporating higher-order terms. PolySAE recognizes that feature interactions aren't merely additive but often involve complex, polynomial interactions. It achieves this through low-rank tensor factorization within a shared projection subspace. The result? A model that captures both pairwise and triple interactions with a modest 3% increase in parameter overhead on the GPT-2 model.

Why does this matter? Because PolySAE demonstrates an 8% improvement in probing F1 scores across four language models and three SAE variants. More impressively, it shows 2 to 10 times larger Wasserstein distances between class-conditional feature distributions. This indicates a sharper distinction between different classes, providing clearer insights into the model's decision-making process.

Decoding the Noise

One of PolySAE's most intriguing revelations is how it decouples feature interactions from mere surface statistics. Unlike traditional SAEs, where feature covariance correlates strongly with co-occurrence frequency (r = 0.82), PolySAE's interaction weights are largely independent (r = 0.06). This suggests that PolySAE captures the underlying compositional structure of language data rather than just surface-level patterns.

The implications are significant. By understanding these interactions, PolySAE offers a more faithful representation of how language models parse and generate meaning. This isn't just a technical upgrade. it's a leap toward models that can genuinely understand and manipulate language as humans do.

The Path Forward

Yet, one must ask: why stop here? If PolySAE can reveal such hidden layers within current models, what other insights might it uncover in even more advanced architectures? It pushes us to rethink not just the tools we use, but the very assumptions that underpin machine learning.

Ultimately, PolySAE is a reminder that the dollar's digital future is being written in committee rooms, not whitepapers. In this case, committee rooms are the collaborative spaces of researchers who question the status quo. Their work doesn't just improve models. it reshapes our understanding of what these models can achieve.

PolySAE: Unveiling Hidden Interactions in Neural Networks

Challenging the Status Quo

Decoding the Noise

The Path Forward

Key Terms Explained