Rethinking Data Strategies for Training Small Language...

In the ongoing quest to refine Small Language Models (SLMs) for reasoning tasks, it's clear that the traditional SFT-then-RL pipeline needs a strategic revamp. The paper, published in Japanese, reveals an intriguing approach that emphasizes the necessity of aligning data strategy with the unique functions of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).

Aligning Data with Training Stages

Traditionally, the process of enhancing SLMs involves a sequential pipeline where Supervised Fine-Tuning is followed by Reinforcement Learning. The oversight, however, lies in not considering the nature of the data being fed at each stage. The data shows that SFT should be focused on acquiring new, unmastered reasoning skills, while RL should aim to consolidate partially accessed skills.

Why is this distinction important? Compare these numbers side by side. The proposed method organizes training data into stage-specific sets, enhancing the model's ability to tackle reasoning tasks effectively. For instance, hard samples during the SFT stage are transformed via a Bridge mechanism, converting raw reasoning traces into more digestible supervision for SLMs. It’s a surgical approach to data handling that could redefine efficiency.

Dealing with Hard Samples

Notably, the method doesn't stop there. For difficult samples that remain unresolved even during RL, a Critique Fine-Tuning process is employed. This involves converting failures into new diagnostic and repair traces, setting the stage for another round of SFT. It’s a classic case of turning failures into stepping stones.

What the English-language press missed: This approach isn't just about throwing more data at the problem. It's about smartly categorizing and using data to enhance learning at each stage of the pipeline. It’s a refinement that’s not just incremental, but potentially transformative for SLM training.

Performance Metrics that Matter

Experiments across two SLMs and five reasoning benchmarks demonstrate consistent improvements over existing SFT, distillation, and RL baselines. The benchmark results speak for themselves. In a field where performance gains are often marginal, these results suggest a significant leap forward.

Crucially, this framework underscores the importance of coordinating data difficulty across SFT and RL stages. But why should we care? As AI systems become more integrated into decision-making processes, their ability to reason accurately and efficiently becomes important. The proposed strategy paves the way for models that don’t just learn better, but reason better.

The real question is: Will the rest of the field take note and adapt? As this approach gains traction, it could set a new standard for how we train language models, pushing the boundaries of what's possible in machine reasoning.

Rethinking Data Strategies for Training Small Language Models

Aligning Data with Training Stages

Dealing with Hard Samples

Performance Metrics that Matter

Key Terms Explained