Decoding the Limits: GPUs and Physical AI Inference
NVIDIA GPUs face bottlenecks in physical AI workloads, revealing that faster memory doesn't equal faster results. CUDA Graphs show promise but come with caveats.
Physical AI systems like robots and autonomous vehicles run inference workloads distinct from cloud-based models. These systems often work with single-stream, batch-1 autoregressive decoding. Here, every robot or camera feed waits on the next token. It's a process that many describe as memory-bandwidth-bound, meaning latency should align with peak High-Bandwidth Memory (HBM) bandwidth. But there's more to it than just bandwidth.
GPU Performance Dissected
We took a deep dive into three 7 to 8 billion parameter Generalized Question Answering (GQA) transformers across four NVIDIA GPUs: the H100 SXM5, A100-80GB SXM4, L40S, and L4. Our tests spanned context lengths from 2048 to 16384, giving us data from 44 valid cells under a specific bf16 SDPA setup. What did we find? The achieved fraction of peak HBM bandwidth declines as peak bandwidth rises. On a key test, the Qwen-2.5-7B with ctx=2048, an L4 GPU hits around 81% of its analytic memory floor. Contrast that with the H100, which only reaches 27%.
This data suggests that physical AI decoding remains memory-dominated. However, faster memory doesn't necessarily lead to proportional latency gains. Here's where CUDA Graphs come into play. On the H100 at ctx=2048, CUDA Graphs improved decode latency by 1.259x across ten fresh sessions. In stark contrast, the same intervention on the L4 only achieved a 1.028x improvement. These findings highlight a launch-side overhead visible on faster GPUs but largely hidden on their slower, bandwidth-bound counterparts.
What’s the Real Impact?
So, what does this mean for deployment? Memory savings only matter if the runtime can fully realize them. On the L4, bf16 decode operates close to the memory floor. However, common quantized paths don't deliver the anticipated 4x weight-traffic reduction. For instance, bnb-nf4 clocks in at 59.36 ms/step and AutoAWQ+Marlin at 45.24 ms/step, both from a 62.32 ms bf16 baseline. Meanwhile, GPTQ+ExLlamaV2, equipped with Ada-tuned int4 kernels, achieves a dramatic 17.36 ms/step.
The takeaway? Faster memory alone won't solve your latency woes. It's the combination of memory savings and optimized runtimes that truly pushes the needle. Are we judging GPUs too harshly, or is it time to rethink our approach to physical AI inference?
Get AI news in your inbox
Daily digest of what matters in AI.