Do Transformers Really Need Positional Encoding?
Recent research challenges the long-held belief that positional encoding is vital for transformers. Is the sliding window mechanism enough to break permutation symmetry and enable universal computation?
For the longest time, the AI community has held on to the belief that positional encoding (PE) is a must-have for transformers to make sense of ordered sequences. But what if that foundational assumption is being upended? The latest research suggests just that, arguing that transformers can perform quite well without positional encoding.
The Assumed Necessity of Positional Encoding
Traditionally, PE has been seen as the linchpin that allows transformers to handle sequence data without losing track of order. Without it, the thinking goes, transformers can’t differentiate between the arrangement of context tokens, making them supposedly inadequate for tasks requiring sequence understanding.
But hold your horses. This new study makes a compelling case that the sliding window mechanism, which already exists in many models, might just be sufficient to introduce the necessary level of permutation symmetry breakage needed for the computations at hand.
Meet the HIST Model
Enter the HIST model, an abstract autoregressive model that takes advantage of a sliding context window. The researchers prove it's Turing complete, meaning it can perform any computation a computer can do. The kicker? It doesn’t need positional encoding to achieve this feat.
The HIST model relies on a constant-size internal state and the token-count histogram within the current window to function. It even reveals the token that has just exited the window, providing enough information to simulate Turing-complete Post machines. That’s significant because it shows that sliding mechanisms can do more than just serve as a placeholder. They’re breaking symmetry and adding expressiveness.
In Practice: A New Kind of Transformer
The researchers didn’t just stop at theory. They went ahead and built a sliding-window transformer over a constant-size token alphabet, skipping the PE altogether. And guess what? It successfully simulated the HIST model. The press release said AI transformation. The employee survey said otherwise. But here, the research backs up a transformative approach in AI thinking.
Why should this matter to you? Because it’s a sharp reminder that the AI models we often take as gospel aren’t set in stone. They’re evolving, and sometimes they don’t need all the bells and whistles we’ve been told are indispensable.
Why This Matters
So, what should we take away from this? First, let’s remember to question what’s considered a given. If positional encoding isn’t the necessity it was cracked up to be, what other AI staples might be ripe for re-evaluation? And if the sliding window can indeed break permutation symmetry, why aren't more companies talking about ditching PE to enhance performance and efficiency?
The gap between the keynote and the cubicle is enormous, and this research challenges us to bridge it by reassessing our technological assumptions. It’s time to align our AI tools with what actually works on the ground.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A model that generates output one piece at a time, with each new piece depending on all the previous ones.
The maximum amount of text a language model can process at once, measured in tokens.
The process of measuring how well an AI model performs on its intended task.
Information added to token embeddings to tell a transformer the order of elements in a sequence.