MOSAIC Framework: Accelerating Mixture-of-Agents on...

Efficiency in Mixture-of-Agents (MoA) systems is often hampered by the inherent complexity of routing queries. These systems rely on multiple expert large language models (LLMs) to enhance reasoning accuracy. The challenge? Limited GPU resources can bottleneck performance. MOSAIC, a new scheduling framework, aims to solve this.

Beyond Traditional Scheduling

Traditional scheduling strategies falter in handling the load imbalances caused by skill-based routing. Not to mention, the variability in generation lengths of instruction-tuned and long-reasoning models exacerbates the issue. MOSAIC's answer to this challenge is twofold.

First, it introduces an Integer Linear Program (ILP) based scheduler. This scheduler optimizes expert placement and per-worker prompt assignment, tackling the offline-profiled costs. By replicating reasoning experts across workers and pinning down lightweight ones, MOSAIC ensures efficient resource use.

Confidence-Aware Adaptive Aggregation

The second innovation is its confidence-aware adaptive aggregation. MOSAIC smartly leverages inter-expert agreement to bypass the resource-heavy final aggregator LLM for consensus queries. This means less GPU idling and more throughput, essential in systems with limited GPU resources.

In a 4-GPU setup, MOSAIC boasts impressive performance gains: up to 2.5x speed in the expert stage, 4.23x in the aggregator stage, and 1.7 to 2.3x end-to-end speedups over conventional schedulers. All while keeping accuracy nearly intact, with a deviation within just 0.1 percentage points.

Why This Matters

Why should developers care? Because MOSAIC transforms how we handle MoA workloads on constrained hardware. With GPUs being a costly resource, maximizing their efficiency is critical. MOSAIC not only accelerates processing but also paves the way for more sophisticated MoA applications without the need for massive infrastructure investments.

Here's the relevant code: MOSAIC eliminates the need for traditional schedulers to suffer from GPU idling. It optimizes every bit of processing power to ensure high throughput and minimal lag.

Are existing scheduling strategies obsolete in the face of MOSAIC's advancements? They just might be. As AI workloads grow more complex, frameworks like MOSAIC will be the key to maintaining efficiency without sacrificing accuracy.

MOSAIC Framework: Accelerating Mixture-of-Agents on Limited GPUs

Beyond Traditional Scheduling

Confidence-Aware Adaptive Aggregation

Why This Matters

Key Terms Explained