CART: Rethinking Transformer Efficiency with a...

The AI-AI Venn diagram is getting thicker with CART (Context-Anchored Recurrent Transformer), a fresh take language models that aims to reshape how we think about inference and efficiency. But does it really measure up?

Redefining the Transformer

Traditional transformers are known for their expansive parameter requirements, but CART flips the script. By reusing a shared core block multiple times across depths, it's an attempt to trim the fat without cutting corners. The key innovation here's the transition from recalculating key-value tensors at every iteration to computing them once, then letting the recurrent core cross-attend through multi-head latent attention. It's a bold move, one that could redefine efficiency in language models.

However, this isn't just about reducing parameters. A Linear Time-Invariant (LTI) gate ensures that the recurrence remains stable, with a spectral radius consistently falling between 0.79 and 0.83 across all 36 fully-trained configurations. It's a neat trick to maintain stability, but the question remains: at what cost?

Testing the Waters

CART's evaluation took place on single consumer GPUs over two distinct stages. Initially, 64 configurations were screened over 3,000 steps. Then, 36 configurations (with varying parameters) underwent intensive training for 30,500 steps, amounting to roughly 1 billion tokens. This dual-stage approach revealed intriguing patterns. For instance, the prelude depth often outperformed the loop count, flipping assumptions about efficiency and performance at scale.

Yet, the real test came at the binding width of 1024, where CART faced off against a parameter-matched dense baseline. Here, CART stumbled, trailing by 1-2% in stored-parameter parity and about 10% in effective-parameter parity. It suggests that while innovation is evident, the execution might need refinement.

Gaps and Gains

Diagnostic ablations offered insights into where CART fell short. The effective-parameter gap of ~10% broke down into ~5% due to weight sharing and another 5% from the complex prelude/anchor/core/coda framework. Surprisingly, the recurrent-core machinery, including hyper-connections and the LTI gate, proved to be more ornamental than functional.

Perhaps the most telling finding was the performance degradation beyond the trained depth range, casting doubt on CART's flexibility for test-time depth scaling. If agents have wallets, who holds the keys to their optimization?

, CART represents a promising convergence of ideas in AI models, but it's not without its challenges. The compute layer needs a payment rail, and while CART might not be it yet, it opens the door to new possibilities.

CART: Rethinking Transformer Efficiency with a Recurrence Twist

Redefining the Transformer

Testing the Waters

Gaps and Gains

Key Terms Explained