Redefining Rewards in Long-Form Generation with...

Long-form generation in reinforcement learning often hits a snag: the absence of reliable reference answers and effective automatic metrics. Without these, calibrating scores across complex responses remains problematic. Enter Tournament-GRPO, a groundbreaking framework aiming to shake up the existing rubric-based methods that depend heavily on pointwise LLM-as-a-judge scoring.

The Tournament Revolution

Traditionally, absolute scores used in these rubric-based methods struggle with discrimination among similar rollouts and often hit saturation during optimization. Tournament-GRPO takes a different path. It turns rubric-guided judgments into relative rewards through multi-round tournaments within groups of the same-query rollouts. By comparing candidates in a competitive setting, it accumulates tournament outcomes and normalizes them into group-wise rewards for GRPO training. This is where the key contribution lies.

Performance on the Deep Research Bench

Why should you care? Because Tournament-GRPO isn't just a theoretical improvement. Experiments conducted on the Deep Research Bench reveal that it outperforms existing reward-design baselines significantly, with a 4.52-point overall-score improvement over the strongest baseline. That's not a minor upgrade. It's a leap. Such a gain in performance highlights how shifting to a tournament-based reward mechanism can push the boundaries of efficiency and effectiveness in training dynamics.

Beyond the Numbers

The ablation study reveals more than just numbers. It demonstrates a promising effectiveness-efficiency trade-off, showcasing how the design of tournaments can influence training dynamics. This is a important insight, suggesting that the nature of these comparisons matters significantly. Will this tournament approach be the new standard in long-form generation reward systems? It certainly seems poised to be.

So, what's likely missing? While the results are impressive, how these methods perform in varied real-world applications remains to be tested. The paper's key contribution is clear: a novel approach that potentially resets the playing field for reinforcement learning in open-ended tasks. But broader adoption and testing will reveal its true impact. Code and data are available for those eager to experiment further.

Redefining Rewards in Long-Form Generation with Tournament-GRPO

The Tournament Revolution

Performance on the Deep Research Bench

Beyond the Numbers

Key Terms Explained