The Illusion of Warm-Start Superiority in Vision-Language Models
Investigating two-stage post-training for VLMs reveals Stage-1's limited impact. Despite entropy gains, OPD's RL warm-start advantages are localized and uncertain.
world of AI, vision-language models (VLMs) have been buzzing with excitement, particularly around two-stage post-training strategies. The theory is simple: a Stage-1 warm-start, usually through supervised fine-tuning (SFT) or on-policy distillation (OPD), followed by Stage-2 reinforcement learning (RL), should enhance model performance. But does the hype hold? Let's dissect this claim with a closer look at the Qwen2.5-VL-7B model.
Stage-1's Limited Impact
Stage-1's supposed magic is under scrutiny. In a study using a 72B VLM as a teacher for OPD, warm-starts barely moved the needle on Geometry3K's internal validation. Here, results clustered tightly between 53% and 54%, showing that Stage-1 hardly alters the in-domain outcome. If you thought slapping a model on a GPU rental was a convergence thesis, think again.
an over-trained SFT variant witnessed a significant -9.5 point drop on the MathVista dataset, a decline reversed by an early-stopped SFT that gained 2.1 points back. This suggests that over-training can be detrimental, but a well-timed stop can offset the losses. Yet, this doesn't elevate Stage-1 to a breakthrough.
The Entropy Illusion
Here's where it gets interesting: OPD enters RL with notably higher policy entropy than SFT starts. Entropy, or the measure of uncertainty, supposedly gives OPD an edge. This higher entropy is visible in the trajectories, suggesting more diverse answer possibilities and a slightly higher pass rate at 16 attempts, a 2.0 to 5.2 point lead over SFT. But don't get too excited. this advantage dissipates after RL.
Why should you care? Because the supposed OPD benefits are less substantial than they appear. After RL, the endpoint pass rates converge within 1.1 points. On MathVista, six models showed variations within 1.2 points. So, if the AI can hold a wallet, who writes the risk model?
What Does This Mean?
The takeaway is clear: Stage-1's contribution is tightly tied to the entropy regime, with localized and minor benefits. OPD's supposed superiority as an RL warm-start doesn't hold water. The intersection is real. Ninety percent of the projects aren't.
So, when next you encounter claims of revolutionary advancements through two-stage post-training, ask yourself: where's the empirical evidence? This analysis shows that while Stage-1 might play a role, it's not the transformative leap it's often made out to be.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A technique where a smaller 'student' model learns to mimic a larger 'teacher' model.
The process of taking a pre-trained model and continuing to train it on a smaller, specific dataset to adapt it for a particular task or domain.
Graphics Processing Unit.
A learning approach where an agent learns by interacting with an environment and receiving rewards or penalties.