Mixed Batching Faces a Bandwidth Battle

JUST IN: Mixed batching, the go-to strategy for large language model (LLM) inference, might not be the golden ticket it once was. Some wild new data's thrown a wrench into the works. If you're running high-bandwidth GPUs, you're likely in the clear. But for those stuck with bandwidth-constrained units, exclusive batching (EB) is ready to steal the spotlight.

Breaking Down the Bottleneck

Here's the kicker: on the H200 with a whopping 4.8 TB/s bandwidth, mixed batching only gets bogged down when decode tokens exceed 80% of the batch. Sounds decent, right? But hold up. On the RTX PRO 6000, where bandwidth dives to 1.792 TB/s, that threshold drops to a mere 20%. It's like hitting the brakes on a smooth highway drive because the road's suddenly turned rocky.

So what's the deal? It all boils down to GPU memory bandwidth, model size, and your workload's exact composition. The labs are scrambling to figure out the best approach. They've even cooked up a closed-form condition to pinpoint when exclusive batching takes over from mixed batching, alongside optimal phase-switching thresholds.

EB's Surge on Low Bandwidth

Optimized exclusive batching can crank up throughput by a massive 41.9% on bandwidth-challenged GPUs. That's no small feat. Meanwhile, mixed batching continues to shine on high-bandwidth hardware, especially with larger models.

But who wants to switch manually? Enter EB+, a hybrid scheduler that dynamically toggles between EB and MB without lifting a finger. Under shifting traffic conditions, it consistently delivers top or near-top throughput, leaving mixed batching in the dust with up to 36.4% more performance.

The Future of Inference Scheduling

This changes the landscape. But here's the burning question: why stick with a one-size-fits-all when EB could be the next big thing on low-bandwidth gear? For those not married to mixed batching, it's time to reevaluate your strategy. The leaderboard shifts with these findings, and it's time to adapt or get left behind.

In a world where every byte counts, your choice of batching strategy could make or break your performance metrics. So, are you sticking with mixed batching, or is it time to give exclusive batching its due spotlight?

Mixed Batching Faces a Bandwidth Battle

Breaking Down the Bottleneck

EB's Surge on Low Bandwidth

The Future of Inference Scheduling

Key Terms Explained