Unraveling the Mystery of Grokking in Machine Learning

The phenomenon of grokking in machine learning is akin to that 'aha' moment, where a model suddenly generalizes long after it has perfectly memorized its training data. But what's behind this delay, and can we predict it?

Weight Decay and Grokking

Researchers have long observed that weight decay plays a turning point role in grokking. But until recently, no quantitative theory explained how long it takes for grokking to occur after memorization. The breakthrough? A first-principles theory suggesting a norm-driven representational phase transition in regularized training dynamics. In simpler terms, a model initially finds a high-norm solution that memorizes data, then shifts to a lower-norm, more structured representation that allows for generalization.

The Scaling Law Revealed

Here's where it gets interesting. The delay between memorization and grokking isn't random. It's governed by a specific scaling law: T_grok - T_mem = Theta((1 / gamma_eff) * log(||theta_mem||^2 / ||theta_post||^2)). The effective contraction rate of the optimizer, denoted as gamma_eff, is key. For stochastic gradient descent (SGD), gamma_eff equals eta * lambda, while for AdamW, it's at least eta * lambda. This scaling law emerges from a mix of Lyapunov contraction arguments and regularized first-order optimization dynamics.

Why Grokking Matters

The study's findings across 293 training runs on tasks like modular addition, multiplication, and sparse parity confirm key predictions: grokking inversely scales with weight decay and learning rate, and shows a logarithmic dependence on norm ratio with an R2>0.97. Surprisingly, not all optimizers are created equal. While AdamW handles grokking reliably, SGD falters under specific hyperparameters.

Why should we care about this arcane-sounding delay? Because understanding grokking could revolutionize how we approach training models. If we can predict when a model will truly generalize, we can optimize training processes, saving time and computational resources.

The Road Ahead

These insights provide a foundation for further exploration. Will future research refine these scaling laws, or uncover new variables that influence grokking? One thing's certain: the chart tells the story. As models and tasks grow increasingly complex, such foundational knowledge will be invaluable. Are we on the cusp of a new era in machine learning training dynamics? The trend is clearer when you see it.