MUD Optimizer: A Leap Forward in Transformer Training

Optimizing the training of transformers, those powerhouses behind modern language models, remains a significant challenge. The orthogonalized-momentum optimizer, Muon, once touted for its innovative approach using polar decomposition, now finds a formidable contender in MUD, or MomentUm Decorrelation.

Introducing MUD

MUD emerges as a fresh approach to whitening momentum updates in transformer training. Rather than relying on the heavy machinery of Muon's polar decomposition, MUD employs a triangular, Cholesky-like whitening surrogate, drawing inspiration from the classical methods of Gram-Schmidt and Gauss-Seidel.

What's noteworthy here's the efficiency. MUD doesn't just tweak the existing methodology. It redefines it by sidestepping multiple large matrix multiplications, typically required by its predecessors. As a result, the overheads are minimized, drastically reducing dependency on hardware capabilities.

Performance Gains and Practical Implications

In practical terms, MUD consistently improves time-to-perplexity by 10-50% over both AdamW and Muon. This isn't a small feat. Imagine training a large language model like GPT-2 on an A100 GPU. MUD enhances peak tokens per second by 1.3 to 2.6 times, and in some settings, nearly triples performance.

Why should this matter to you? If you're developing or refining large-scale language models, the time you save can be reinvested in more iterations, leading to better-tuned models and faster deployment. For those in the trenches of AI development, every second counts.

A Strong Performance in Protein Language Models

MUD doesn't just shine in the field of generic language models. It's demonstrated its prowess in niche applications, such as training an ESM-2 150M protein language model. Here, MUD achieves Muon-level validation perplexity in significantly less wall-clock time. This opens doors for faster and more efficient biological model training, which can accelerate breakthroughs in protein understanding and drug discovery.

The Future of Transformer Training

Color me skeptical, but I've seen this pattern before. New optimizers often promise the moon but falter in practical applications. However, MUD's blend of classical methodologies with modern demands presents a compelling case for its adoption. The reduction in computational overhead means more researchers can access high-performance training, democratizing AI advancements.

So, the question is: Will MUD become the new standard in transformer training? Given its impressive performance metrics and reduced time demands, it certainly has the potential to reshape the landscape. Let's apply some rigor here and keep a close watch on its adoption and performance in production environments.