Local MoE Inference: Closing the Gap with Cloud Performance

Local deployment of large Mixture-of-Experts (MoE) models has long struggled to match the service quality found in cloud-scale environments. The reality is, even with low-concurrency workloads, local setups fall short. Here's what the benchmarks actually show: there are four significant gaps.

The Gaps Exposed

First, the reliance on capacity-reduced models like quantized or distilled versions hampers full performance. Second, local systems struggle to meet a 30-second time-to-first-token (TTFT) with long context windows beyond 12K tokens. Third, there's sub-baseline decode throughput, coming in under 20 tokens per second. Finally, they show poor concurrency with mixed workloads.

A Solution in Hybrid Systems

Enter a new CPU-GPU hybrid system that promises to bridge this gap without the need for massive datacenter infrastructure. How? By achieving cloud-level service level objectives (SLOs) on dual-socket commodity CPUs and consumer-grade GPUs. This isn't just about keeping up, it's about redefining the potential of local MoE inference.

The system employs several techniques: stream-loading prefill accelerates throughput to 1,200 tokens per second, enabling 32K prompts in 30 seconds. Distributed stream-loading with SmallEP expert parallelism boosts this to 1,800 tokens per second, managing 45K prompts on a pair of RTX 5090s. Further, an innovative dual-batch attention-MoE overlap scheme sustains concurrency with less than a 15% latency increase, yet yields a 50% throughput gain.

Optimizing Every Component

Consider the AVX-512-optimized FP8 GEMV kernel. It allows native CPU FP8 inference, slashing CPU latency by 4-5x. Add fine-grained CPU parallelism and you get a throughput of 28 tokens per second on INT4 DeepSeek-V3 and 21.5 tokens per second on intact FP8 V3. This isn't just incremental improvement, it's a leap forward.

So, what's the takeaway? This system delivers cloud-level quality of service for flagship MoE models on consumer platforms, maintaining original precision and inference quality. Cost-effective, high-quality access without the need for datacenter-level infrastructure? That's a big deal for local deployment.

Here's a question to ponder: If local systems can now rival cloud performance, how will this reshape the industry's dependence on cloud providers? It's a shift that could democratize access to powerful AI, making it more affordable and widespread.