Muon Optimizer: The Hidden Power Laws of Large Language...

Orthonormalized update rules have taken the spotlight as the optimizer of choice for training large language models. Among them, Muon stands out for its adoption by open-source, state-of-the-art models. But what's happening under the hood? Muon employs the Newton-Schulz (NS) iteration to orthonormalize updates, though it's only an approximation.

The Power Behind Muon

Muon applies NS iterations to the momentum matrix at every step. However, there's a gap in our understanding of how the singular value spectrum of these momentum matrices behaves during the training process, especially as model sizes scale from 77 million to 2.8 billion parameters. The data shows that these quantiles stabilize quickly after a short burn-in period. Crucially, the stabilization values are determined by both layer type and model size. This behavior follows neat power laws, with layer-dependent exponents. That's the kind of pattern that can guide the future of efficient computation in AI.

NS Iterations: A Double-Edged Sword

While mid-late layers scale gently with model size, aroundM^{-0.25}, allowing the standard five-step NS configuration to remain effective even at larger scales, late layers tell a different story. They scale aggressively, up toM^{-0.96}. Without adjustments, these layers risk falling into the NS failure regime at the frontier scale. Does this mean more NS iterations or better-tuned coefficients are the solution? Perhaps. But, NS iterations are computationally costly at such scales. The power laws from this study offer practitioners a tactical way to choose the minimum NS configuration needed, ensuring efficiency without compromising update quality.

Why This Matters

The paper, published in Japanese, reveals a pathway for training efficiency as AI models grow ever larger. Western coverage has largely overlooked this aspect of AI training. The benchmark results speak for themselves, offering a roadmap to avoid unnecessary computation. Compare these numbers side by side with previous models, and it's clear: Muon with its NS iterations, despite being expensive, can be tuned to orthonormalize the directions that matter most.

So what's the takeaway? As model sizes continue to balloon, AI engineers must carefully consider these power laws. The choice of optimizer configuration could very well define the scalability and efficiency of next-generation AI. Are we ready to embrace these new methodologies, or will we stick with outdated practices?

Muon Optimizer: The Hidden Power Laws of Large Language Models

The Power Behind Muon

NS Iterations: A Double-Edged Sword

Why This Matters

Key Terms Explained