The Illusion of Warm-Start Superiority in...

world of AI, vision-language models (VLMs) have been buzzing with excitement, particularly around two-stage post-training strategies. The theory is simple: a Stage-1 warm-start, usually through supervised fine-tuning (SFT) or on-policy distillation (OPD), followed by Stage-2 reinforcement learning (RL), should enhance model performance. But does the hype hold? Let's dissect this claim with a closer look at the Qwen2.5-VL-7B model.

Stage-1's Limited Impact

Stage-1's supposed magic is under scrutiny. In a study using a 72B VLM as a teacher for OPD, warm-starts barely moved the needle on Geometry3K's internal validation. Here, results clustered tightly between 53% and 54%, showing that Stage-1 hardly alters the in-domain outcome. If you thought slapping a model on a GPU rental was a convergence thesis, think again.

an over-trained SFT variant witnessed a significant -9.5 point drop on the MathVista dataset, a decline reversed by an early-stopped SFT that gained 2.1 points back. This suggests that over-training can be detrimental, but a well-timed stop can offset the losses. Yet, this doesn't elevate Stage-1 to a breakthrough.

The Entropy Illusion

Here's where it gets interesting: OPD enters RL with notably higher policy entropy than SFT starts. Entropy, or the measure of uncertainty, supposedly gives OPD an edge. This higher entropy is visible in the trajectories, suggesting more diverse answer possibilities and a slightly higher pass rate at 16 attempts, a 2.0 to 5.2 point lead over SFT. But don't get too excited. this advantage dissipates after RL.

Why should you care? Because the supposed OPD benefits are less substantial than they appear. After RL, the endpoint pass rates converge within 1.1 points. On MathVista, six models showed variations within 1.2 points. So, if the AI can hold a wallet, who writes the risk model?

What Does This Mean?

The takeaway is clear: Stage-1's contribution is tightly tied to the entropy regime, with localized and minor benefits. OPD's supposed superiority as an RL warm-start doesn't hold water. The intersection is real. Ninety percent of the projects aren't.

So, when next you encounter claims of revolutionary advancements through two-stage post-training, ask yourself: where's the empirical evidence? This analysis shows that while Stage-1 might play a role, it's not the transformative leap it's often made out to be.

The Illusion of Warm-Start Superiority in Vision-Language Models

Stage-1's Limited Impact

The Entropy Illusion

What Does This Mean?

Key Terms Explained