The Overlap Dilemma: Rethinking SFT and GRPO in AI Training

In the evolving field of AI, the usual formula of Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO) isn't always the golden ticket to success. Recent research sheds new light on the significance of data overlap between these stages and its impact on model performance.

Understanding the Experiment

Researchers conducted a controlled ablation study focusing on the Qwen3-8B model, particularly its application to Lean 4 autoformalization. They examined six distinct training scenarios, each varying in the degree of overlap between SFT and GRPO data. These scenarios included a base model, SFT-only, GRPO-only, and three configurations with 0 percent, 30 percent, and 100 percent data overlap between SFT and GRPO.

Crucially, the study's findings reveal that keeping the data sets separate, 0 percent overlap, yields superior results without additional computational cost. This approach consistently outperforms full overlap across various benchmarks, such as Gaokao-Formal and PutnamBench, where both compile pass and semantic pass accuracy were evaluated using a Large Language Model (LLM) judge.

Why Data Overlap Matters

The ablation study reveals a essential insight: lower overlap between SFT and GRPO stages is directly associated with higher compilation and semantic accuracy. Specifically, at zero percent overlap, GRPO achieves a 10.4 percentage point improvement in semantic accuracy over SFT alone on the Gaokao benchmark. In stark contrast, at full overlap, performance stagnates, rendering the GRPO phase essentially redundant.

This raises an important question: why invest time and resources into GRPO if full overlap nullifies its benefits? The findings suggest a need to rethink conventional training paradigms and highlight the potential of treating data overlap as a critical post-training hyperparameter.

The Broader Implications

Beyond the numbers, this research challenges the status quo in AI training methodologies. It calls for a reevaluation of the widely accepted practice of data sharing between SFT and GRPO stages. The dual-metric evaluation, which considers both compile and semantic accuracy, uncovers substantial gaps, over 30 percentage points in some cases, that would remain invisible under a compile-only benchmark approach.

The paper's key contribution lies in its pioneering exploration of SFT-GRPO data overlap as a training hyperparameter. It's a fresh perspective that may lead to more efficient and effective AI models. As AI continues to integrate deeper into critical applications, optimizing every aspect of model training becomes essential. With data overlap now under scrutiny, the path forward may require more innovative, nuanced strategies.

What they did, why it matters, what's missing. The study's rigorous approach to evaluating training overlap provides a compelling argument for change. As AI researchers and practitioners contemplate these findings, one question lingers: will we see a shift in training practices, or will tradition continue to hold its ground?

The Overlap Dilemma: Rethinking SFT and GRPO in AI Training

Understanding the Experiment

Why Data Overlap Matters

The Broader Implications

Key Terms Explained