Steering Language Models: Predictable Yet Tricky

Language models are powerful, but steering them with features like Sparse Autoencoder (SAE) can be messy. The idea sounds simple: tweak a feature, and watch the model shift its behavior. Yet, the road to effective steering is far from smooth. Feature inconsistency and unexpected side effects can throw a wrench in the works.

Predictability Over Complexity

The intrigue lies in the predictability of these side effects. By screening features before intervention, statistical data can forecast the potential hiccups in steering. Models like GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B show varying degrees of predictability. The magic lies in decoder geometry, activation stats, co-activation structures, and direct-logit footprints. Translation? These models tell us more than mere activation magnitudes or frequency counts.

Across different settings, GPT-2-small, Pythia-70M, and Llama-3.1-8B shine the brightest. They keep their predictive promise even when adjusted for magnitude-related confounds. But Gemma-2-2B? Not so much. It's like predicting the weather. Some days it's sunny, others it's a surprise downpour.

Model-Dependent Success

There's a catch, of course. Not all features steer equally well across contexts. GPT-2-small leans towards clean steering, Pythia-70M favors stability, Llama-3.1-8B is all about minimizing collateral damage, and Gemma-2-2B, well, it only partially plays along. Think of it as choosing the right tool for the job. Each model and setting comes with its own rules.

A controlled test on Llama's dictionary width, jumping from 32K to 128K, shows that predictive signals hold. But the stability of these results? Less assured. It's a reminder: language models, certainty is a luxury.

Why It Matters

So, why should we care about SAE steering side effects? Because it’s not just a theoretical exercise. It’s about making these language models more efficient and reliable. Imagine a future where feature steering is as predictable as clockwork. The question really is: how do we harness this predictability across diverse models and settings?

Don't overlook the quirks and unique settings each model brings. If you're dabbling with these models, knowing what affects predictability could save you a world of hassle. Another week, another Solana protocol doing what ETH promised. Just remember, the path to clean steering isn't a one-size-fits-all. If you haven't bridged over yet, you're late.