Bebop: Supercharging RL with Multi-Token Prediction

By Callum BryceJune 11, 2026

Bebop, a new approach in reinforcement learning, tackles the rollout bottleneck with Multi-Token Prediction. This change promises a huge speed boost in model training.

JUST IN: Reinforcement learning just got a major upgrade. Meet Bebop, a new game plan for integrating Multi-Token Prediction (MTP) into large-scale RL pipelines. Let's break it down.

The Rollout Bottleneck

Reinforcement learning, that powerhouse behind large language models, hits a snag during the rollout stage. It's the slowest part of the pipeline. Think of it as a traffic jam in a high-speed chase. MTP was supposed to help speed things up, but its acceptance rates have been a letdown. They nosedive during training, which means less speed boost than expected. What's the deal?

Bebop's got answers. It turns out that model entropy, the measure of uncertainty in predictions, throws a wrench in the works. As entropy rises in RL, MTP acceptance rates fall. It's a negative relationship that Bebop aims to flip.

Probabilistic Rejection Sampling to the Rescue

Sources confirm: Probabilistic rejection sampling is a breakthrough. It handles the entropy challenge better than the old-school greedy draft sampling. And that's not all. Bebop introduces a new training approach with TV loss, ditching the standard cross-entropy or KL objectives. This new strategy optimizes acceptance rates directly, and we're talking about a wild 10% boost, hitting up to 95% acceptance. That's massive.

With Bebop's method, you get up to 25% more throughput in reasoning, coding, and agentic tasks. It's like upgrading from a bicycle to a motorbike speed.

Pre-RL MTP Training: The Secret Sauce

Here's where it gets even better. Bebop's approach of pre-RL MTP training with end-to-end TV loss and rejection sampling stabilizes acceptance rates throughout RL. Forget about costly online updates. This method, confirmed by experiments, delivers up to 1.8x acceleration in async RL training for models like Qwen3.5, Qwen3.6, and Qwen3.7.

And just like that, the leaderboard shifts. We're looking at a new era in RL training. But here's a question: Why wasn't this done sooner? The labs are scrambling to keep up.

Bebop isn't just a tweak, it's a leap forward. If you're in the game of training large language models, this is your new secret weapon. Get on board or get left behind.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.

Bebop: Supercharging RL with Multi-Token Prediction

The Rollout Bottleneck

Probabilistic Rejection Sampling to the Rescue

Pre-RL MTP Training: The Secret Sauce

Key Terms Explained