The Complex Art of Forgetting: AI Models and Their Memory Quirks
AI models forget different data points during fine-tuning, revealing architectural quirks. These findings could reshape how we approach training techniques.
AI, understanding what models forget as they learn isn't just an academic exercise. It has real implications for how we design training regimens and optimize performance. Recent findings shed light on the complex forgetting patterns of AI models, and the results may surprise you.
Architectures That Forget Differently
forgetting, not all AI models are created equal. Take ResNet-18 and DeiT-Small, for instance. These architectures process and, importantly, forget information in distinct ways. Research on their performance with a retinal OCT dataset and a bird species dataset (CUB-200-2011) has shown that the overlap of forgotten samples between these two architectures is quite low. Jaccard overlap scores were a mere 0.34 and 0.15 on the respective datasets. Clearly, different architectures forget fundamentally different samples.
Structured Versus Stochastic Forgetting
It's intriguing to note that Vision Transformers (ViTs) like DeiT-Small forget in a more structured manner compared to CNNs like ResNet-18. With a mean R-squared value of 0.74 for ViTs versus 0.52 for CNNs, the predictability of forgetting in ViTs is notably higher. But here's the kicker: per-sample forgetting is stochastic when randomness is introduced. The correlation between samples forgotten across different training runs is almost nonexistent, with Spearman's rho hovering around 0.01.
The Nature of Sample Difficulty
We often assume that if a model repeatedly forgets specific samples, those samples must be inherently 'difficult'. Yet, the stochastic nature of forgetting challenges this assumption. If sample difficulty isn't intrinsic, what factors are truly at play? Could it be the dataset balance, the architecture, or something more elusive?
Implications for Curriculum Design
Forgetfulness patterns extend beyond individual samples to class-level data. Visually similar species are consistently forgotten more than distinctive ones, offering a semantic layer to forgetting. This insight could guide curriculum design or data pruning, though there’s a catch. Even when samples are arranged based on difficulty using their loss after initial training, the decay constants of their retention didn't provide much predictive power. Static scheduling methods, like spaced repetition based solely on these constants, fail to outperform random sampling.
Why It Matters
The AI-AI Venn diagram is getting thicker. If we understand the quirks of forgetting, we can better exploit architectural diversity in ensemble models. This might just lead to more reliable AI systems. But here's the big question, are we at a point where training strategies need a significant rethink? The data suggests that sticking with static methods may be limiting our potential.
As AI continues its relentless march forward, the way these models forget might just be the key to unlocking their full potential. The compute layer needs a payment rail, and understanding these memory quirks is part of that foundational infrastructure.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
The processing power needed to train and run AI models.
The process of taking a pre-trained model and continuing to train it on a smaller, specific dataset to adapt it for a particular task or domain.
The process of selecting the next token from the model's predicted probability distribution during text generation.
The process of teaching an AI model by exposing it to data and adjusting its parameters to minimize errors.