The Delicate Dance of On-Policy Distillation: Promise...

It's no secret that the quest to refine large language models is a challenging endeavor. Two techniques that have recently captured attention are on-policy distillation (OPD) and on-policy self-distillation (OPSD). These methods, in theory, offer dense supervision on trajectories sampled from the model's own policy. Yet, the results have been a mixed bag, with some promising signs tempered by reports of instability.

The Need for Precision in Teacher Choice

On-policy distillation, particularly when applied to tasks like mathematical reasoning, faces a unique hurdle: the sensitivity to the choice of teacher and the formulation of losses. What exactly does this mean? If the teacher model or the loss function isn't chosen with precision, the entire process can falter. This sensitivity underscores the broader issue of alignment, a topic that's been at the forefront of machine learning discussions.

: can we consistently find teachers that push models in the right direction without introducing new biases? This isn't merely a technical detail, but a fundamental challenge that could shape the future trajectory of language models.

OPSD's Limitations and Potential

On-policy self-distillation, or OPSD, offers its own set of challenges. While effective in certain contexts, it struggles when deprived of instance-specific privileged information (PI) during testing. However, when PI represents a common latent rule, magic happens. For instance, a system prompt or an alignment preference can serve as a shared rule that OPSD harnesses effectively.

But what about when PI is instance-specific? Here, OPSD stumbles, unable to effectively aggregate PI-conditioned teachers. This limitation suggests that OPSD, while promising, might require more nuanced applications to truly shine.

Mitigating Failures: The Path Forward

Identifying the failure mechanisms is half the battle. From distribution mismatches and optimization instability to OPSD-specific limitations, these issues are hurdles, not roadblocks. The good news? Solutions are on the horizon. Stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students offer potential pathways to mitigate these challenges.

So, why should readers care? Because the success of these methods could redefine how we think about model training and alignment. With refined models, the possibilities expand, offering more reliable AI systems across applications.

The question that remains is whether researchers and practitioners can bridge the gap between potential and implementation., that persistent issues in AI are eventually overcome with concerted effort.

The Delicate Dance of On-Policy Distillation: Promise and Pitfalls

The Need for Precision in Teacher Choice

OPSD's Limitations and Potential

Mitigating Failures: The Path Forward

Key Terms Explained