Revolutionizing LLM Post-Training: The PEAR Approach

Training large language models (LLMs) is a complex affair. Traditionally, it involves an offline supervised fine-tuning (SFT) stage followed by online reinforcement learning (RL). Yet, optimizing SFT in isolation often leads to suboptimal RL outcomes. That's where the PEAR method steps in, transforming LLM training.

The Misalignment Challenge

Typically, SFT aims solely to maximize its own performance. But here's the rub: after RL training, models initialized from stronger SFT checkpoints can sometimes underperform those starting from weaker ones. Why? The answer lies in the data distribution mismatch between offline SFT and online RL stages. Offline SFT data and RL policy rollouts often diverge, causing friction in the learning process.

This misalignment means models can't fully capitalize on their SFT training during RL. In a field where precision is key, such inefficiencies can't be ignored. So, how do we bridge this gap?

Enter PEAR

PEAR, or Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting, offers a solution. It adjusts the SFT stage to better align with RL requirements. By using importance sampling to reweight SFT loss, PEAR ensures a smoother transition between the stages. This method operates on three levels: token, block, and sequence, allowing for fine-tuned adjustments without hefty training overhead.

Visualize this: you've got a finely-tuned engine ready for the racetrack, but it's set up for a different type of race. PEAR acts like a pit crew, recalibrating settings to match the track ahead. That's the kind of alignment we're talking about.

Proven Gains

The real test of any method is in the results. PEAR consistently enhances post-RL performance compared to standard SFT. Controlled experiments on reasoning games and mathematical tasks using Qwen 2.5 and 3, and DeepSeek-distilled models, show PEAR's efficacy. On the AIME2025 benchmark, PEAR's approach resulted in performance gains of up to 14.6 percent. The chart tells the story. When you see such clear improvement, it's hard to argue against the method's merits.

So, what does this mean for the future of LLM training? It’s simple. Stop treating SFT and RL as isolated stages. Align them. The trend is clearer when you see it in action. By thinking holistically, PEAR sets a new standard for LLM post-training.

Will this become the norm? It's a strong possibility. As the demand for more efficient and effective LLMs grows, methods like PEAR could become indispensable. For now, it's a promising step forward in the ongoing quest for AI excellence.