Mousse: Rethinking Spectral Optimization for Deep Learning

Spectral optimization has made significant strides, particularly with Muon, which leverages the Stiefel manifold to enhance training velocity and generalization. Yet, Muon’s assumption of an isotropic optimization environment is flawed when applied to Deep Neural Networks, where the curvature spectrum deviates heavily from the norm. Enter Mousse, a fresh approach that melds spectral stability with the geometric adaptivity of second-order preconditioning.

The Problem with Isotropy

Why should we concern ourselves with isotropy in optimization? The AI-AI Venn diagram is getting thicker, and it's essential to understand that in the high-dimensional space of neural networks, not all directions are created equal. Muon's uniform update norm can inadvertently amplify instabilities in high-curvature directions, while essential progress in flatter areas stagnates. This isn't just a minor flaw. it's a fundamental inefficiency that Mousse seeks to rectify.

Mousse: A New Perspective

Mousse stands out by executing optimization in a whitened coordinate space, informed by Kronecker-factored statistics borrowed from Shampoo. This method not only retains the structural integrity of spectral methods but also adapts dynamically to the landscape's varying curvatures. Essentially, Mousse redefines the optimization problem as a spectral steepest descent, bounded by an anisotropic trust region, ultimately computing the optimal update through the polar decomposition of the whitened gradient. This isn't a partnership announcement. It's a convergence.

The Results Speak

In practical terms, Mousse is achieving impressive outcomes. Testing on language models ranging from 160 million to 800 million parameters, it consistently outshines Muon by reducing training steps by approximately 12%, all while maintaining similar computational demands. The compute layer needs a payment rail, and Mousse is laying down the tracks.

Rethinking Optimization

Should we abandon isotropic constraints entirely? If agents have wallets, who holds the keys? Mousse suggests we might need a more nuanced approach than rigid isotropy allows. By addressing the heavy-tailed, ill-conditioned landscapes of deep learning, Mousse offers a compelling case for rethinking how we approach optimization in neural networks. It's about time we embrace these insights to drive forward the next wave of AI advancements.