MONA Optimizer: The New Frontier in Language Model Training

By Nadia OseiMay 27, 2026

MONA introduces a novel take on optimization by merging orthogonalization with curvature-aware acceleration, surpassing Muon and AdamW in performance.

In the intricate dance of training large language models, optimizers take center stage. The latest to make waves is MONA, a advanced optimizer that cleverly combines Muon's matrix orthogonalization with a curvature-aware acceleration strategy. This might just be the breakthrough we've been waiting for in the battle against sharp local minima.

Understanding the MONA Difference

MONA's secret sauce lies in its ability to circumvent the traps that first-order methods often fall into. By integrating an acceleration term derived from the exponential moving average of gradient differences, MONA sidesteps the stagnation that often plagues optimizers like Muon and AdamW. This isn't just theoretical mumbo jumbo. A detailed convergence analysis underscores MONA's prowess in maintaining Muon's spectral-norm regularization while deftly escaping sharp minima.

Performance: Numbers That Speak Volumes

Empirical results paint a compelling picture. MONA isn't just another name in the optimizer game. It's outperforming both Muon and AdamW across varying scales of Mixture-of-Experts pretraining, from models with 1 billion to a whopping 68 billion parameters. The largest of these models was trained on an impressive 1 trillion tokens. These aren't just numbers. They signify a leap in how efficiently and effectively we can train expansive models.

Breaking New Ground in Benchmarks

benchmarks, MONA doesn't shy away. Its application in supervised fine-tuning on the MOE-68B-A3B model shows state-of-the-art (SOTA) performance across capabilities that matter, general tasks, mathematical reasoning, and code generation. The question isn't whether MONA can deliver. It's how soon it will redefine the benchmark standards entirely.

But why should we care? Simple. The world of AI isn't just about building models. It's about optimizing them to perform beyond our expectations. Slapping a model on a GPU rental isn't a convergence thesis. MONA's integration of orthogonalization with curvature-aware tactics is a step toward truly intelligent AI systems. As developers and researchers, we need to ask: are we ready for the next chapter in AI training?

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.

MONA Optimizer: The New Frontier in Language Model Training

Understanding the MONA Difference

Performance: Numbers That Speak Volumes

Breaking New Ground in Benchmarks

Key Terms Explained