Function Vectors: Steering AI Models Beyond Decodability

Function vectors (FVs) have emerged as a compelling tool in AI, promising to steer large language models even when traditional decoding methods, like the logit lens, fall short. In the largest cross-template FV transfer study to date, the findings challenge previous assumptions, revealing a decoupling between steerability and decodability.

The Study

In this comprehensive research, 4,032 pairs across 12 tasks and six models from three different families were analyzed. Models like Llama-3.1-8B, Gemma-2-9B, and Mistral-7B-v0.3, both base and instruction-tuned, were put to the test. The study spanned eight templates per task, making it a solid examination of FV capabilities.

Contrary to the initial hypothesis, FV steering proved successful even when the logit lens couldn't decode the correct answer at any layer. This pattern of steerability without decodability was consistent across every task and model, sometimes with gaps as large as -0.91. Only three out of 72 task-model instances showed decodability without steerability, all involving the Mistral model.

What This Means

This finding is a game changer. It suggests that FVs encode computational instructions rather than straightforward answer directions. Even more intriguing, FVs operate optimally at early layers (L2-L8), while the logit lens identifies correct answers only in later layers (L28-L32). If the AI can hold a wallet, who writes the risk model?

The previously held belief in a negative cosine-transfer correlation dissolves when scaled, with pooled correlations ranging from -0.199 to +0.126. The cosine factor adds negligible value beyond the task identity, challenging the assumption that these correlations are significantly predictive.

Implications for AI Development

This study also uncovers a divergence between model families. While Mistral FVs rewrite intermediate representations, Llama and Gemma FVs make near-zero changes despite successful steering. This bifurcation raises a critical question: Are we approaching the limits of interpretability in AI models?

Decentralized compute sounds great until you benchmark the latency. The practical applications of these findings could redefine how we approach AI model training and deployment, particularly as we aim for more efficient and interpretable systems. Show me the inference costs. Then we'll talk.

In sum, the dissociation between steerability and decodability in FVs offers a tantalizing glimpse into the future of AI control. As we unravel these complexities, the implications for AI's role in decision-making and problem-solving will be monumental.