The Dilemma of Training AI on Synthetic Data
Training AI models on synthetic data can lead to a drift from the target distribution. This article delves into the reasons and implications of this issue.
Machine learning models are increasingly relying on synthetic data for training and fine-tuning. But there's a hitch. Recursively using such data can lead to significant performance degradation across various tasks. The paper, published in Japanese, reveals that this degradation often manifests as a progressive drift from the intended target distribution.
Unveiling the Drift
In examining score-based diffusion models, researchers provide a theoretical framework to analyze this drift. What the English-language press missed: this is the first instance of establishing a lower bound on the divergence between learned and target distributions, even for standard diffusion models. The findings are both groundbreaking and concerning.
Consider a typical pipeline where each training round incorporates synthetic data alongside fresh samples from the target distribution. In such scenarios, the study offers upper and lower bounds on the accumulated divergence, essentially quantifying the drift from the target.
Why Does This Matter?
So why should we care about this drift? The answer lies in characterizing different drift regimes based on score estimation errors and the proportion of fresh data used. Crucially, in some settings, accumulated divergence after multiple retraining rounds is expressed as a discounted sum of score estimation errors at each generation. It's a glaring reminder of how dependent our models are on data quality and distribution alignment.
A Call for Better Practices
The empirical results presented on synthetic data and images back up the theoretical claims. But here's the burning question: Are we setting ourselves up for failure by being overly reliant on synthetic data? The benchmark results speak for themselves. We see a clear need for better training practices that balance synthetic and real data.
Western coverage has largely overlooked this, but it's time we paid attention. This is a wake-up call for AI practitioners to reassess the balance of data sources in their training regimes. As synthetic data becomes more prevalent, it's imperative to develop methods that mitigate drift and maintain alignment with target distributions.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A mechanism that lets neural networks focus on the most relevant parts of their input when producing output.
A standardized test used to measure and compare AI model performance.
The process of taking a pre-trained model and continuing to train it on a smaller, specific dataset to adapt it for a particular task or domain.
A branch of AI where systems learn patterns from data instead of following explicitly programmed rules.