The Surprising Instability of AI Memory in Image Classifiers

Fine-tuning pretrained image classifiers has become a standard move in the AI toolkit. But have you ever wondered which samples these models forget and if the forgetting patterns are stable? Spoiler alert: they're not. Recent experiments on ResNet-18 and DeiT-Small expose some curious dynamics that debunk the notion of predictable forgetting.

Forget About Consistency

The study tracked per-sample correctness during fine-tuning on two datasets: a highly imbalanced retinal OCT set with seven classes and CUB-200-2011 featuring 200 bird species. The findings are as intriguing as they're unsettling. For starters, ResNet-18 and DeiT-Small forget different samples entirely. The Jaccard overlap of the top 10 percent most-forgotten samples is a shocking 0.34 for the OCT dataset and even less, 0.15, for CUB-200. If you've been banking on model consistency, think again.

the structured forgetting pattern observed in Vision Transformers (ViT), with an R-squared value of 0.74, offers a stark contrast to the more haphazard forgetting of CNNs, which scored an R-squared of 0.52. Clearly, not all neural architectures are created equal in their retention capabilities.

The Myth of Intrinsic Difficulty

One might assume that sample difficulty is an intrinsic property. Not so fast. The study found that per-sample forgetting across random seeds is nearly stochastic, with a Spearman correlation of about 0.01. This randomness challenges existing assumptions and raises a critical question: if difficulty isn't intrinsic, what's it?

Class-level forgetting, however, tells a more coherent story. Visually similar species are forgotten more easily than their distinctive peers, offering a glimpse into the semantic underpinnings of AI errors. Interestingly, a sample's loss after a head warmup can predict its long-term decay constant, with correlations ranging from 0.30 to 0.50. That's something, but it's far from the whole story.

The False Promise of Static Schedules

You'd think that armed with this knowledge, we could engineer a better sampling strategy. Yet, when researchers built a spaced repetition sampler on these decay constants, it failed to outperform random sampling. The implication? Static scheduling can't capitalize on unstable per-sample signals. Slapping a model on a GPU rental isn't a convergence thesis, after all.

These findings suggest diversified architecture in ensemble models could offer complementary retention coverage. But the real kicker is this: current curriculum or pruning methods based on per-sample difficulty might not generalize across different runs. The intersection is real. Ninety percent of the projects aren't.

The Surprising Instability of AI Memory in Image Classifiers

Forget About Consistency

The Myth of Intrinsic Difficulty

The False Promise of Static Schedules

Key Terms Explained