Decoding the Hidden Language of Transformers: A Look...

In the intricate world of transformers, a fascinating phenomenon known as polymorphism has been uncovered. This occurs when independently trained transformers compute identical functions, but do so in residual-stream bases that diverge by a uniform random rotation. Despite these differences, a simple mathematical maneuver, one matrix multiplication, can harmonize these seemingly disparate models.

The Procrustes Solution

This alignment is achieved through an orthogonal Procrustes fit on a single batch of activations. This process effectively transfers sparse-autoencoder feature dictionaries and steering vectors between independently trained models, eliminating the need for retraining. This revelation challenges the standard SAE universality metric, which remains oblivious to such nuances.

Despite this, decoder-column cosine similarity across seeds matches a striking 98%, a figure that stands as the headline number for SAE universality. Yet, when an SAE trained on one seed attempts to reconstruct another seed's activations, it performs worse than predicting the constant mean, a result that underscores the complexity hidden beneath the surface.

The Role of Rotation

Polymorphism's key player, the rotation matrix R, restores reconstruction to within an impressive 0.025 EV of the within-seed ceiling. This matrix is Haar-distributed, meaning its properties align closely with random-orthogonal predictions. At a model dimension of 512, the difference from the identity matrix is a mere 0.1%, a testament to its precision.

The rotation matrix's eigenvalue spectrum passes the rigorous Kolmogorov-Smirnov test against Haar SO(d_model) with flying colors, boasting a p-value nearing 1.000. This mathematical dance ensures steering vectors transfer smoothly across three regimes based on their alignment with R's invariant subspace.

A Universal Language?

One might ask: what implications does this have for the field of machine learning? Quite simply, it suggests a universal applicability of the rotation account across training checkpoints within a single run, hinting at a deeper, shared language among transformers. However, without shared input/output (as in the Pythia models), all three transfer regimes collapse into a universally inverted state.

This discovery, validated on a modest 104k-parameter Dyck-3 transformer and nine independently trained Pythia-70m seeds, opens the door to further exploration at frontier scales of 10 billion parameters and beyond. It challenges us to reconsider how we understand model training and alignment landscape of artificial intelligence.

, the concept of polymorphism in transformers offers a fresh perspective on model interoperability. While the immediate practical applications might be limited, the theoretical implications are vast. It prompts us to ask: are we on the verge of uncovering a universal mathematical language that transcends model-specific architectures? The journey to answer this question is only beginning.

Decoding the Hidden Language of Transformers: A Look into Polymorphism

The Procrustes Solution

The Role of Rotation

A Universal Language?

Key Terms Explained