Rethinking Synthetic Data: Overcoming Training Hurdles...

Rethinking Synthetic Data: Overcoming Training Hurdles with MTS

By Rina ShimizuJune 2, 2026

While synthetic data are increasingly used in neural network training, mismatches with real data reduce effectiveness. A new study identifies challenges in the Meta-learning for Training-data Selection (MTS) process and suggests solutions to optimize results.

Synthetic data are transforming neural network training, but their indiscriminate use often hits a wall due to distributional mismatches with real-world data. This is where Meta-learning for Training-data Selection (MTS) comes into play. MTS aims to optimize data weights through bi-level optimization. However, the reality of its performance often falls short of its promises.

Challenges in MTS Training

The paper, published in Japanese, reveals two primary obstacles in successfully training MTS. First, there's the issue of a poor gradient signal-to-noise ratio (GSNR), which complicates the optimization process. Second, a lack of informative features correlates poorly with data quality. These factors together make it difficult to harness the full potential of synthetic data in enhancing neural networks.

Notably, the researchers conducted a mathematical analysis that highlights the dynamics associated with normalized data weights. The data shows that there's a link between varied data quality and subpar GSNR, which is key for understanding why MTS doesn't always perform as expected.

Proposed Solutions and Experimentation

Crucially, the study suggests a straightforward yet effective remedy: increasing the batch size. This adjustment appears to improve the GSNR, providing a more stable foundation for optimization. Additionally, the researchers propose using a set of informative features that can better capture the position and dynamics of training data within their distributions.

Experiments conducted across four distinct benchmarks present compelling results. By implementing these changes, the researchers achieved average gains of 5.49% over training without selection. When compared to the strongest existing baseline, they saw an improvement of 2.89%. The benchmark results speak for themselves.

Why It Matters

So why should anyone care about these findings? The increasing reliance on synthetic data means that even small improvements can lead to significant advancements in AI performance. What the English-language press missed is how these methodological tweaks could set a new standard for data selection in training.

Are we witnessing the beginning of a new era in training methodologies? If these improvements are scalable, they could redefine how we approach neural network training. As AI systems become more integral to critical applications, the importance of using the best possible data can't be overstated.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.

Rethinking Synthetic Data: Overcoming Training Hurdles with MTS

Challenges in MTS Training

Proposed Solutions and Experimentation

Why It Matters

Key Terms Explained