Do Transformers Need All Three Projections?
Exploring the impact of projection sharing in transformers, researchers find potential for reduced memory usage without significant performance loss. Could this be the breakthrough edge devices need?
Transformers have become a staple in AI, driving solutions across various tasks. But, have we considered if all the intricacies like query, key, and value projections are truly necessary? The story looks different from Nairobi, where the efficiency of AI models can mean the difference between feasible and impractical deployment.
Breaking Down Projection Sharing
In a recent study, researchers took a closer look at the three projections in transformers: query, key, and value, known as QKV. They tested three different constraints: shared key-value (Q-K=V), shared query-key (Q=K-V), and a single projection (Q=K=V). The experiments spanned synthetic tasks, vision datasets like MNIST and CIFAR, and language models with up to 1.2 billion parameters.
Interestingly, the key finding was that these transformers performed on par or sometimes even better than the traditional QKV setup. The standout was the Q-K=V projection, which managed to cut the KV cache in half with only a slight 3.1% dip in perplexity. That's a trade-off many would take, especially when memory is at a premium.
The Edge Deployment Advantage
Why should this matter to you? Picture deploying AI on edge devices with limited memory. The Q-K=V projection, combined with head sharing techniques like GQA and MQA, can achieve up to 96.9% cache reduction. That's not just an improvement, it's a potential big deal for practical on-device inference.
This isn't about replacing workers. It's about reach. With projection sharing, AI can become more accessible in regions where computational resources are scarce. The benefits aren't just theoretical. They're quantifiable and directly impact memory usage.
Quality Versus Complexity
But here's where it gets controversial. The study suggests that Q-K=V works because keys and values occupy similar representational spaces, keeping the low-rank nature of attention intact. Meanwhile, Q=K-V disrupts this balance, breaking the directionality of attention.
This raises a important question: Are we over-complicating transformers with three projections when one or two could suffice? The answer could reshape how we think about AI efficiency.
Silicon Valley designs it. The question is where it works. For those of us in emerging economies, these findings aren't just academic. They could set the stage for AI technologies that are both affordable and effective, without compromising too much on quality.
The farmer I spoke with put it simply: "If it costs less and does the job, why not use it?" That's the crux of the matter. In practice, reducing complexity could lead to broader AI adoption where it's needed most.
Get AI news in your inbox
Daily digest of what matters in AI.