Breaking the Bottleneck: Scaling RL with Synthetic Tasks

The bottleneck in reinforcement learning from verifiable rewards (RLVR) is less about the models themselves and more about the infrastructure required for training. High-quality tasks, important for effective RLVR on language models, demand significant resources. Each task needs a sandboxed environment, a prompt, and a manually crafted reward function. The economics of scaling this manually is daunting.

The Economics of Human Curation

Hand-curated tasks are expensive and don't scale to the levels required for effective RL training. As we push the boundaries of what's possible, the need for efficient task generation becomes evident. But can we rely on automatically generated task variants instead of human-authored ones? The substitution rate between these two options remains murky.

Researchers are testing a new approach using gate-filtered augmentations of a small set of hand-authored tasks as stand-ins for additional human curation. They measured the cost-adjusted trade rate, denoted as ρ_cost, between augmented tasks and their human-authored counterparts. Surprisingly, this rate ranges from 1.4x to 11.6x, depending on the cost ratio between human and augmented tasks. So, what does this mean for the industry?

Augmentation's Role in Scaling

The findings suggest that replacing some human-authored tasks with augmented ones preserves generalization across a diverse range of benchmarks, including code, instruction following, reasoning, and multi-turn agentic function-calling. This means we might be able to maintain training quality while cutting costs significantly.

Here's what inference actually costs at volume: human curation isn't just expensive, it's unsustainable for large-scale RL. The unit economics break down at scale when you consider the sheer number of tasks required. So, should the industry lean heavily into synthetic augmentations?

A Way Forward

While some purists might argue against the quality of synthetic tasks, the numbers paint a different picture. If the cost-adjusted trade rate ρ_costholds, augmentation isn't just a feasible option, it's a necessary one. Follow the GPU supply chain, and you'll see that the real bottleneck isn't the model. It's the infrastructure.

As the demand for more intelligent and versatile models grows, the need for innovative solutions in training them becomes more pressing. If the industry doesn't adapt, it risks falling behind. So, will synthetic tasks be the silver bullet that solves our scaling issues?, but the data is promising.

Breaking the Bottleneck: Scaling RL with Synthetic Tasks

The Economics of Human Curation

Augmentation's Role in Scaling

A Way Forward

Key Terms Explained