ReMoE: Saving Memory in Mixture-of-Experts Models
ReMoE enhances Mixture-of-Experts model efficiency by improving token-wise expert reuse, reducing I/O overhead, and increasing throughput.
Fine-grained Mixture-of-Experts (MoE) models are valued for their ability to activate only a fraction of experts per token, optimizing computation while preserving model capacity. Yet, their effectiveness is hampered when memory is tight. Experts not stored in fast cache must be retrieved from slow external sources, resulting in frequent cache evictions and high I/O costs.
Introducing ReMoE
ReMoE, a novel router fine-tuning framework, addresses these bottlenecks by enhancing token-wise expert reuse. It nudges the router to favor experts recently employed, leading to stable routing that aligns better with cache constraints. In essence, ReMoE boosts short-term expert reuse, slashing the frequency of expert retrievals from storage without burdening inference-time computation.
Performance Gains
Empirical evaluations on the DeepSeek and Qwen models demonstrate ReMoE's prowess. The framework increases expert reuse by 26%, all while maintaining performance in downstream tasks. More importantly, real-system tests reveal an 8.4% boost in output throughput under vLLM GPU-CPU expert offloading. Moreover, it cuts TPOT by a staggering 43.6-49.8% when running on Jetson Orin NX with llama.cpp. This corresponds to a 1.77 to 1.99 times decode speedup spanning diverse workloads.
Why It Matters
ReMoE's impact can't be overstated. For researchers and developers grappling with memory constraints, it offers a clear path to optimize performance without additional computational demands. If you're looking to eke out more efficiency from MoE models, ReMoE might be the solution.
With its significant reductions in I/O overhead and increased throughput, the real question is: why aren't more teams adopting ReMoE? Checkpoints and usage guidelines are accessible at the project's GitHub repository, providing an easy entry point for those ready to take the leap.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
The process of taking a pre-trained model and continuing to train it on a smaller, specific dataset to adapt it for a particular task or domain.
Graphics Processing Unit.
Running a trained model to make predictions on new data.
Meta's family of open-weight large language models.