Depth-Attention: A Smarter Way to Boost Transformer Models

By Nadia OkoroJune 4, 2026

Depth-Attention enhances Transformer performance by improving how layers interact. It achieves lower perplexity and higher accuracy without extra parameters.

In the race to optimize Transformers, Depth-Attention emerges as a novel approach that changes the way layers within these models interact. The traditional Transformer architecture faces a limitation: each layer's output is merely added to the residual stream, limiting later layers' ability to selectively reuse representations from earlier layers. Depth-Attention proposes a solution by integrating selection inside the attention module itself.

Why Depth-Attention Matters

The reality is, while self-attention allows for dynamic information selection across sequences, the same flexibility isn't available across the depth of the model. That's where Depth-Attention steps in, allowing layers to attend over earlier layers' keys at the same token position. This technique effectively mixes values into the value that self-attention reads, turning a previously linear process into a more contextually rich interaction.

Performance Metrics Speak Volumes

Here's what the benchmarks actually show: On Qwen3-style decoders with 1.5B and 3B parameters, Depth-Attention reduces perplexity while boosting average downstream accuracy, outperforming the vanilla Transformer by up to 2.3 accuracy points. Notably, it surpasses strong cross-layer baselines in both perplexity and average accuracy. All this without adding more than 0.01% extra arithmetic FLOPs or any additional persistent inference state beyond the standard key-value cache.

These gains aren't just limited to larger models. From 360M to 3B parameters, Depth-Attention consistently delivers improved results. The architecture matters more than the parameter count, and Depth-Attention proves it.

A Leap Forward or Just a Tweak?

This innovation doesn't introduce new parameters or persistent inference states, keeping the cache size equivalent to a vanilla decoder. It challenges the notion that more parameters are always better. But does this mean Depth-Attention is the future of Transformer models? It's a strong contender.

Strip away the marketing and you get an elegant solution to a longstanding problem. The numbers tell a different story, one where efficiency and performance can coexist without bloating the model with unnecessary complexity. Depth-Attention might just redefine how we approach improving model accuracy and efficiency.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.