Predicting Sparse Autoencoder Steering: A New Approach

Sparse autoencoder (SAE) features have become instrumental in guiding language models. However, steering these models isn't always a straightforward task. The same intervention can behave inconsistently across different contexts, affecting unrelated features. Enter a new pre-intervention screening framework that's poised to change how we forecast SAE steering side effects.

New Framework for Predicting Side Effects

The research introduces a system to predict SAE steering side effects before implementing any interventions. By analyzing feature statistics, the framework aims to forecast outcomes along two key axes of steering modularity: effect stability and collateral spread. This approach was tested on models like GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B. It evaluated these across different SAE dictionaries, including ReLU, JumpReLU, and TopK.

What's intriguing here's the predictive power of certain factors. Decoder geometry, activation statistics, co-activation structure, and the direct-logit footprint of a feature proved better predictors of steering modularity than simpler frequency-only and activation-magnitude baselines. Notably, the signal was strongest in GPT-2-small, Pythia-70M, and Llama-3.1-8B models. These models showed resilience against magnitude-related confounds. In contrast, the Gemma-2-2B model exhibited weaker predictive abilities.

Model-Specific Outcomes

The research goes further by holding out screening to see if it can rank unseen features by their predicted cleanliness. The results? They're promising. Depending on the model, the success varied: GPT-2 showed the cleanest improvements, Pythia mainly improved on stability, Llama on collateral, while Gemma only saw partial improvements. The question arises, why does Gemma lag behind? Is it a model limitation or something more structural?

A controlled comparison using Llama Scope examined dictionary width changes, from 32K to 128K. The predictive signal persisted but with less stable screening payoffs. This highlights a important aspect of the research: while SAE steering side effects can be predicted, the predictors and their effectiveness are highly model and dictionary dependent.

Implications for Language Models

The paper, published in Japanese, reveals a significant advance for language modelers. Understanding how to predict and manage side effects of SAE interventions could lead to more efficient and reliable language models. For researchers and developers looking to fine-tune these systems, this framework could be indispensable. But will broader adoption follow? That's less clear, as the approach requires careful model-specific adjustments.

Western coverage has largely overlooked this. Yet, the benchmark results speak for themselves. The ability to predict and mitigate unwanted side effects is a big deal in steering language models.