The Dilemma of Training AI on Synthetic Data

By Rina ShimizuJune 1, 2026

Training AI models on synthetic data can lead to a drift from the target distribution. This article delves into the reasons and implications of this issue.

Machine learning models are increasingly relying on synthetic data for training and fine-tuning. But there's a hitch. Recursively using such data can lead to significant performance degradation across various tasks. The paper, published in Japanese, reveals that this degradation often manifests as a progressive drift from the intended target distribution.

Unveiling the Drift

In examining score-based diffusion models, researchers provide a theoretical framework to analyze this drift. What the English-language press missed: this is the first instance of establishing a lower bound on the divergence between learned and target distributions, even for standard diffusion models. The findings are both groundbreaking and concerning.

Consider a typical pipeline where each training round incorporates synthetic data alongside fresh samples from the target distribution. In such scenarios, the study offers upper and lower bounds on the accumulated divergence, essentially quantifying the drift from the target.

Why Does This Matter?

So why should we care about this drift? The answer lies in characterizing different drift regimes based on score estimation errors and the proportion of fresh data used. Crucially, in some settings, accumulated divergence after multiple retraining rounds is expressed as a discounted sum of score estimation errors at each generation. It's a glaring reminder of how dependent our models are on data quality and distribution alignment.

A Call for Better Practices

The empirical results presented on synthetic data and images back up the theoretical claims. But here's the burning question: Are we setting ourselves up for failure by being overly reliant on synthetic data? The benchmark results speak for themselves. We see a clear need for better training practices that balance synthetic and real data.

Western coverage has largely overlooked this, but it's time we paid attention. This is a wake-up call for AI practitioners to reassess the balance of data sources in their training regimes. As synthetic data becomes more prevalent, it's imperative to develop methods that mitigate drift and maintain alignment with target distributions.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.

The Dilemma of Training AI on Synthetic Data

Unveiling the Drift

Why Does This Matter?

A Call for Better Practices

Key Terms Explained