Rethinking Self-Distillation with Skill-Conditioned Models

Training large language models (LLMs) often involves self-distillation techniques to improve reasoning. Traditional methods rely heavily on trusted privileged information. But what if this information could come from a less conventional source?

Breaking Away from the Norm

Enter Skill-Conditioned Gated Self-Distillation (SGSD). This approach redefines how we perceive on-policy self-distillation by shifting the focus from mere imitation to hypothesis validation. Instead of trusting only reference answers or verified traces, SGSD taps into a skill bank. This bank contains experience-derived skills that, while compact and reusable, might also be misleading.

Here's what the benchmarks actually show: SGSD retrieves skill-mistake pairs, creating a multi-teacher pool to guide the student model. Each teacher evaluates the student's performance, with the verifier's role being key. It validates each teacher's stance, either supporting success or suppressing failure, thus offering positive reinforcement or reversing negative outcomes. This method is particularly reliable, distilling informative disagreements while ignoring noise.

Performance Metrics and Comparisons

Strip away the marketing and you get the numbers: SGSD consistently outperforms Generalized Reinforcement Policy Optimization (GRPO) and remains competitive with answer-conditioned On-Policy Self-Distillation (OPSD), even under weaker assumptions. On benchmarks like AIME24, AIME25, and HMMT25, SGSD boosts performance on models like Qwen3-1.7B by 6.2% over GRPO, and 1.7% over OPSD on average. That's not a minor achievement.

Why should we care? Because the architecture matters more than the parameter count. SGSD's approach of validating skills rather than replicating them could reshape how LLMs are trained. It's not just about feeding more data or increasing parameters but optimizing the learning process.

The Bigger Picture

Frankly, the reality is that models need to adapt to imperfect information sources. SGSD's methodology could lead to more flexible and resilient AI systems. In a field where efficiency is often prioritized, SGSD offers a fresh perspective. Could this be the future of LLM training?

SGSD's open-source code, available on GitHub, invites the community to explore and innovate further. As AI continues to evolve, methods like SGSD could pave the way for smarter and more adaptive technologies. The numbers tell a different story, and this one suggests a promising direction.

Rethinking Self-Distillation with Skill-Conditioned Models

Breaking Away from the Norm

Performance Metrics and Comparisons

The Bigger Picture

Key Terms Explained