CART: Rethinking Transformer Efficiency with...

The pursuit of making transformers more efficient has taken a new turn with the introduction of CART, the Context-Anchored Recurrent Transformer. CART stands out for its unique approach: it cleverly reuses a single shared core block across its depth. This method is set against traditional looped transformers, which recalculate key-value tensors at each step, a costly endeavor computational resources.

The Mechanics of CART

CART operates by computing its key and value tensors just once from a multi-layer prelude. These are then accessed by the recurrent core via multi-head latent attention. A Linear Time-Invariant (LTI) gate maintains stability in recurrence, keeping the spectral radius within a tight range (0.79 to 0.83). This ensures consistent stability across 36 fully-trained configurations.

Evaluations were conducted on single consumer GPUs, initially with a 64-configuration screen at 3,000 steps, followed by a more focused test with 36 configurations trained over approximately one billion tokens. The results highlight an intriguing trend: the prelude depth (P) significantly outperforms the loop count (R), and this ordering reverses when training is extended. At the largest width of 1024, this reversal becomes critical as R=6 emerges superior beyond a certain scale.

The Limits of Efficiency

Despite its innovations, CART faces a significant hurdle at its largest configurations. When compared against a parameter-matched dense baseline, it falls short by 1-2% at stored-parameter parity and by a wider margin of ~10% in effective-parameter parity. : has CART truly found the sweet spot in transformer efficiency, or is it compromising too much?

Further diagnostics reveal that the gap can be split: about 5% is due to weight sharing, while another 5% stems from the heterogeneous structure of its prelude, anchor, core, and coda. The components designed to enhance the recurrent core, hyper-connections, LTI gate, and loop-index embedding, appear less effective than anticipated.

Looking Forward

What does this mean for AI infrastructure? CART's approach could redefine efficiency if these challenges can be overcome. Yet, its current limitations at scale hint that the infrastructure, rather than the model, might be the real bottleneck. Can future iterations of CART or similar models break this barrier?

As the AI community continues to push towards more efficient models, it's clear that innovations like CART offer valuable insights. But the unit economics break down at scale, reminding us that the quest for efficiency often comes with trade-offs.

CART: Rethinking Transformer Efficiency with Context-Anchored Recurrence

The Mechanics of CART

The Limits of Efficiency

Looking Forward

Key Terms Explained