Rethinking Transformers: The Orthogonality Dilemma

Transformers have long dominated language models. Yet, few have considered how their model dimension truly impacts feature representation. This oversight might soon change.

The Orthogonality Insight

Model dimension, often denoted asd_model, acts as a critical hyperparameter in these models. But how it sets geometric limits on feature representation is surprisingly underexplored. The reality is that the architecture matters more than the parameter count. This new study digs into the Linear Representation and Superposition Hypotheses. These ideas suggest models encode features as near-orthogonal directions in a latent space.

Why should this matter? Because understanding these orthogonal directions can fundamentally alter how we design and evaluate models. Strip away the marketing, and you get a clearer view of how many directions a model can truly support.

Breaking Down the Metrics

The researchers developed a framework to estimate these directions. They used the embedding matrix as a proxy for near-orthogonality constraints across latent spaces. Essentially, they looked at the boundary between meaningful token relationships and incidental similarity. This gives us a concrete estimate of a model's accepted deviation from perfect orthogonality, noted asε.

Here's what the benchmarks actually show: by applying this metric across various open-source models, two classes emerged. Some models with highεlacked near-orthogonal structure. Others, with lowε, maintained it. This differentiation is essential in understanding how models handle data and features.

Model Capacity Revisited

The study challenges the conventional wisdom by showing that the standard Johnson-Lindenstrauss lemma underestimates the packing efficiency of trained representations. Instead, they derived an adjusted capacity formula. This formula suggests the capacity depends on the ratio of vectors to dimensions rather than raw count. This tweak drastically reduces prediction error with no extra parameters.

Representational capacity, then, becomes an upper bound on the number of distinguishable directions in a model's latent space. It's sensitive toε, and larger models surprisingly favor tighter orthogonality constraints over maximizing raw capacity. Could it be that bigger isn't always better? Perhaps these models prioritize stability over sheer scale.

This orthogonality dilemma invites deeper questions. Why do larger models prefer these constraints? Is it about stability, or are there limits to usable concepts? The numbers tell a different story. One thing's for sure: rethinking how we measure model capacity could reshape the entire field.

Rethinking Transformers: The Orthogonality Dilemma

The Orthogonality Insight

Breaking Down the Metrics

Model Capacity Revisited

Key Terms Explained