Reducing AI Memory Load with Identical Cache Reuse

By Rina ShimizuMarch 17, 20262 views

ICaRus proposes a solution to memory challenges in multi-model AI systems by enabling shared KV caches, cutting latency by up to 11.1x.

Multi-model inference is transforming agentic AI systems. But it's not without headaches, notably the memory burden from Key-Value (KV) caches when each model maintains its own set for identical prompts.

Memory Challenges in AI Systems

When models in a system generate their own KV caches, memory demands skyrocket. This often leads to evicting caches, causing time-consuming recomputation when these caches are needed again. The problem worsens with multiple models as they must recompute KV caches for the same prompt, amplifying overhead.

The ICaRus Solution

ICaRus, a proposed architecture, aims to alleviate these burdens. It allows multiple models to share their KV caches across all layers, a important innovation. The paper, published in Japanese, reveals that ICaRus leverages the decomposition of a decoder-only Transformer into a logical encoder and decoder. By fine-tuning only the decoder and freezing the encoder, the system reduces memory demand and enables cross-model KV cache reuse.

Efficiency and Performance Gains

The benchmark results speak for themselves. ICaRus achieves up to 11.1x lower P95 latency and 3.8x higher throughput in workflows involving eight different models. Notably, it maintains accuracy on par with task-specific fine-tuned models across various tasks. The data shows that by incorporating lightweight adapters like LoRA, it further parallelizes KV cache generation, enhancing efficiency.

Why ICaRus Matters

Why should we care about ICaRus? Simply put, it's a major shift for efficiency and scalability in AI systems. In an industry where speed and memory efficiency are important, reducing latency and boosting throughput can't be ignored. But the real question is, how quickly will other companies adopt this architecture? Western coverage has largely overlooked this, yet the potential impact on the industry is significant.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.