Revisiting KL Divergence: A Deeper Look into RL and LLMs

Reinforcement learning (RL) has become a turning point technique in enhancing the reasoning capabilities of large language models (LLMs). Yet, a recent study indicates that current methods for incorporating KL divergence in RL objectives might be flawed. The paper, published in Japanese, reveals that the popular practice of approximating the KL divergence introduces discrepancies, affecting the effectiveness of training.

The Challenge with KL Divergence

KL divergence functions as a regularization term in RL objectives, guiding the model to align with a reference policy. However, computing this term exactly is computationally intractable. Consequently, various estimators are employed. The data shows that these estimators, although widely used, may not be as effective as believed.

Recent findings suggest that these estimators introduce gradient biases, which may lead to training instabilities. The research team examined models like Qwen2.5-7B, Llama-3.1-8B-Instruct, and Qwen3-4B-Instruct-2507, adjusting KL configurations to observe impacts on performance across different tasks.

Gradient Bias and Model Performance

The study presents a clear conclusion: configurations that result in unbiased gradients significantly enhance model performance on both in-domain and out-of-domain tasks. This isn't just a minor technical detail, it's a key insight that could reshape how RL is applied to LLMs. The benchmark results speak for themselves. Models trained with these unbiased configurations consistently outperform their peers.

Yet, one must ask, why has this not been addressed sooner? Perhaps it's a case of inertia in the field, where established practices go unchallenged until substantial evidence demands change.

Implications for Off-Policy Settings

Interestingly, the study doesn't stop at on-policy settings. It extends its analysis to off-policy scenarios, where RL training often encounters instability due to asynchronous setups. Here, the role of KL regularization becomes even more pronounced. The research finds that proper KL configurations can indeed stabilize training, a key factor for models operating in dynamic environments.

Western coverage has largely overlooked this nuanced aspect of RL in LLMs. By focusing on empirical observations, this study urges a reevaluation of how KL divergence is integrated into RL frameworks. It's a call to action for researchers and practitioners: reconsider your methods, for the current paradigms might not be as sound as once thought.