FOGO: Tackling Forgetting in Machine Learning One...

Forgetting isn't just a hiccup in continual learning. It's a broader optimization hurdle that pops up every time a model trains. Here's the thing: the dominant mini-batch gradients often overshadow those rare but key update directions, leading to short-term forgetting at every step. If those forgotten bits of knowledge aren't revisited, they compound into what's known in the field as long-term forgetting. It's the classical pitfall of continual learning.

Meet FOGO

Enter FOGO, a scalable optimizer that's tackling this forgetfulness head-on. How? By detecting and resolving gradient interference, FOGO spectrally orthogonalizes momentum updates. It prevents these dominant directions from hogging the optimization process. Think of it this way: it's like giving each direction its fair shot rather than letting the loudest voice in the room dominate.

FOGO also stores past directions in a compact codebook memory. Built on random projection, this codebook ensures that pairwise distances are maintained even in low-dimensional spaces. In simple terms, it keeps a neat, compressed record of useful directions without any data storage. Whenever there's a clash between the current update and stored directions, FOGO steps in with a lightweight orthogonal correction.

Why Should We Care?

Here's why this matters for everyone, not just researchers. Across various applications, be it class-imbalanced classification, continual visual learning, or fine-tuning models like LLaVA-7B and GPT-2 pretraining, FOGO consistently outshines the well-known Adam and Muon optimizers. It achieves better convergence and retains knowledge more effectively. If you've ever trained a model, you know how frustrating it's to lose ground on tasks a model supposedly learned. FOGO seems to offer a way out of that trap.

Now, let's consider the broader implications. In a world where models are expected to learn continuously across shifting domains and classes, retention becomes critical. Can FOGO be the answer we've been looking for? Honestly, I think it's a step in the right direction. By addressing the root cause of forgetting, this approach could redefine how we think about optimization in machine learning.

FOGO's technique of resolving gradient conflicts through orthogonal corrections isn't just clever, it's necessary. With minimal overhead and no need for data storage, it promises an efficient, scalable solution. But here's the provocative question: Will this become the new standard, or is it just another fleeting trend in the relentless pursuit of optimization?

FOGO: Tackling Forgetting in Machine Learning One Gradient at a Time

Meet FOGO

Why Should We Care?

Key Terms Explained