Rethinking LLM Scheduling: Why Output Length Prediction Falls Short
In large language model inference, predicting a single output length for scheduling is flawed. A new approach using Tail Inflated Expectation could improve performance.
Scheduling inference in large language models (LLMs) often relies on predicting output lengths to manage requests effectively. The traditional approach, which centers on the shortest job first (SJF) principle, prioritizes tasks with shorter outputs to cut down on wait times. But, frankly, this method misses a key point about the nature of LLMs.
Why Single Predictions Fail
The reality is, predicting a single output length doesn't align with the stochastic nature of LLM decoding. Outputs aren't fixed. They're inherently uncertain and determined by when the end-of-sequence token appears. Strip away the marketing and you get a system trying to apply certainty to a process that's anything but.
Here's what the benchmarks actually show: output lengths follow a heavy-tailed distribution. This means they could potentially be way longer than predicted. To accommodate this, a distribution-based prediction, specifically using the log-t distribution, is proposed, making it more aligned with how these models function.
The Tail Inflated Expectation Solution
Enter the Tail Inflated Expectation (TIE). Rather than sticking with a rigid length estimate, TIE adjusts for the risk of longer outputs by factoring in tail probabilities. It's a simple metric, but the numbers tell a different story. TIE reduces per-token latency by a striking 2.31 times for online inference. For offline data generation, throughput improves by 1.42 times. That's not just a marginal gain, it's a significant leap in efficiency.
But why should anyone care about these metrics? Because they fundamentally change how we can handle LLM scheduling. Faster processing means more efficient handling of requests, directly impacting user experience and resource allocation.
What's Next?
So, are single length predictions obsolete? It certainly seems so. The architecture matters more than the parameter count. Adapting to the stochastic nature of LLMs with TIE could set a new standard for efficient inference scheduling.
As AI continues to evolve, it's clear that clinging to outdated methods won't suffice. This shift towards a distribution-based approach could redefine how we think about scheduling in AI systems. The question is, will the industry embrace this change or stick with the status quo?
Get AI news in your inbox
Daily digest of what matters in AI.