Why Transformers Can't Stop Eyeing the First Word

By Leila FaroukApril 17, 2026

Transformers like GPT-2 have a fixation on the first word. Despite complex architectures, this 'attention sink' is a common flaw.

Transformers, those mighty engines behind AI models like GPT-2, have a curious quirk: an 'attention sink' that draws them to the first position in a sequence. It's like always picking the same favorite in a horse race, no matter the odds. But why does this happen?

Decoding the Obsession

To get to the bottom of this, researchers took a deep dive into GPT-2-style models. They looked at learned query biases, absolute positional embeddings, and the first-layer transformation of positional encoding. They even threw in some causal interventions for good measure. The result? They found that the attention sink emerges from how these components interact. But here's the kicker: each piece is individually dispensable. You can remove any one component, and the sink still shows up. It's like a hydra, cut off one head, and two more take its place.

The Anatomy of a Sink

So what's really driving this? The researchers pinpointed three key culprits: learned query bias, the first-layer MLP transformation, and the structure in the key projection. Yet, the fact that these elements are individually unnecessary suggests multiple pathways for the sink to emerge. Think of it as a traffic jam caused not by a single broken car, but by a complex interplay of factors. The benchmark doesn't capture what matters most, the real-world impact of these architectures.

Why Should You Care?

Now, why should you care about a bunch of attention sinks in AI models? Because they're not just quirks, they could lead to biases in the data these models process. Whose data? Whose labor? Whose benefit? These aren't just academic questions. They're about real impacts on real people. Imagine a world where AI editors keep misplacing your article's lead because they can't stop looking at the wrong line.

It's time to ask ourselves some hard questions. Why do we continue to build systems with known flaws? And what does it say about the oversight in AI development? The paper buries the most important finding in the appendix, but it's shouting at us: there's something deeply systemic here that needs fixing.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.

Why Transformers Can't Stop Eyeing the First Word

Decoding the Obsession

The Anatomy of a Sink

Why Should You Care?

Key Terms Explained