Beam Search: When More Isn't Better for LLMs

JUST IN: Wider beam search, often seen as the holy grail for large language model (LLM) reasoning, isn't always the answer. Sure, expanding your search might seem like the logical step to boost performance, but when do you hit the point of diminishing returns? That's the question researchers are dissecting, and the findings might surprise you.

The Bias in Beam Width

Sources confirm: stretching your beam width too far can actually degrade the quality of your output. Yeah, you heard that right. The promise of wider search can backfire thanks to a pesky little thing called overestimation bias. When you're dealing with noisy scorer outputs, the bias grows as your candidate pool expands. There's a maximum beam width, let's call it the critical width, beyond which your LLM's performance takes a nosedive.

What's behind this phenomenon? It's all about the signal-to-noise ratio. The critical width, denoted as $χ$, is intricately tied to this ratio. It balloons exponentially with $(Δ/σ)^2$, where $Δ$ is the quality advantage of correct outputs over incorrect ones, and $σ$ is the noise level of the scorer.

Testing the Theory

Researchers didn't just theorize. They put this idea to the test across three 7 billion-parameter models in ten domains using MR-BEN's dataset. The results are wild. Perplexity scoring, with its high noise levels, showed that widening beyond a beam width of one isn't just useless, it's counterproductive. On the flip side, PRM scoring, which boasts lower noise, finds its sweet spot at a beam width of four or higher, delivering gains up to 8.9 percentage points. Same model, same algorithm, different scorers, and bam, your ideal beam width is on opposite sides of the spectrum.

Why This Matters

So, what's the takeaway here? It's simple but bold: not all beam widths are created equal. The choice of scorer is the hidden hand guiding your beam width decision. It's not merely about wider being better. It's about the quality of the noise you're willing to tolerate. And just like that, the leaderboard shifts.

Here's a thought: Why are we so hooked on expanding horizons without considering the noise that comes with it? This analysis is clear. The real challenge is choosing the right beam width based on the scorer's signal-to-noise ratio. Don't just follow the crowd. Think critically about what's affecting your model's output quality. The labs are scrambling to recalibrate their approaches. The question is, will you?