Rethinking Transformers: The Orthogonality Dilemma
Transformers' model dimensions influence feature representation more than expected. A new framework challenges assumptions about their capacity.
Transformers have long dominated language models. Yet, few have considered how their model dimension truly impacts feature representation. This oversight might soon change.
The Orthogonality Insight
Model dimension, often denoted asdmodel, acts as a critical hyperparameter in these models. But how it sets geometric limits on feature representation is surprisingly underexplored. The reality is that the architecture matters more than the parameter count. This new study digs into the Linear Representation and Superposition Hypotheses. These ideas suggest models encode features as near-orthogonal directions in a latent space.
Why should this matter? Because understanding these orthogonal directions can fundamentally alter how we design and evaluate models. Strip away the marketing, and you get a clearer view of how many directions a model can truly support.
Breaking Down the Metrics
The researchers developed a framework to estimate these directions. They used the embedding matrix as a proxy for near-orthogonality constraints across latent spaces. Essentially, they looked at the boundary between meaningful token relationships and incidental similarity. This gives us a concrete estimate of a model's accepted deviation from perfect orthogonality, noted asε.
Here's what the benchmarks actually show: by applying this metric across various open-source models, two classes emerged. Some models with highεlacked near-orthogonal structure. Others, with lowε, maintained it. This differentiation is essential in understanding how models handle data and features.
Model Capacity Revisited
The study challenges the conventional wisdom by showing that the standard Johnson-Lindenstrauss lemma underestimates the packing efficiency of trained representations. Instead, they derived an adjusted capacity formula. This formula suggests the capacity depends on the ratio of vectors to dimensions rather than raw count. This tweak drastically reduces prediction error with no extra parameters.
Representational capacity, then, becomes an upper bound on the number of distinguishable directions in a model's latent space. It's sensitive toε, and larger models surprisingly favor tighter orthogonality constraints over maximizing raw capacity. Could it be that bigger isn't always better? Perhaps these models prioritize stability over sheer scale.
This orthogonality dilemma invites deeper questions. Why do larger models prefer these constraints? Is it about stability, or are there limits to usable concepts? The numbers tell a different story. One thing's for sure: rethinking how we measure model capacity could reshape the entire field.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A dense numerical representation of data (words, images, etc.
A setting you choose before training begins, as opposed to parameters the model learns during training.
The compressed, internal representation space where a model encodes data.
A value the model learns during training — specifically, the weights and biases in neural network layers.