AutoMegaKernel Shakes Up CUDA with a Single Launch Forward Pass
AutoMegaKernel compiles Llama-family models into one CUDA kernel, eliminating hand-written CUDA. It's not just speed, it's a system overhaul.
JUST IN: There's a new kid on the block AI model compilation. Meet AutoMegaKernel (AMK), the latest innovation that's turning heads. Forget rewriting CUDA code for each model. AMK compiles HuggingFace Llama models into a single, persistent CUDA kernel for the entire forward pass. Why does that matter? Because it's a massive shift in efficiency and system design, not just raw speed.
The Power of AMK
Sources confirm: AMK isn't just about running faster. It's a comprehensive system that incorporates a frozen schedule-IR validator to ensure deadlock-freedom and race-freedom. That's right, the system statically checks the graph to reject an unsafe schedule before it even launches. With over 7,160 adversarial schedules tested, including 6,091 unsafe ones, there wasn't a single false-accept. That’s zero misses, folks.
What's more, the same source code can be retargeted from sm_80 to sm_120, auto-generating correct megakernels for all supported models. And when it was put to the test on a real SmolLM2-135M checkpoint, it matched HuggingFace's greedy decode down to the token. This isn't just about doing things differently, it's about doing them better.
Performance That Speaks Volumes
Now, let's talk numbers. AMK's search-found int8 (W8A16) megakernel outperforms CUDA-graphed cuBLAS bf16 at batch-1 decode across NVIDIA's datacenter fleet. We're talking up to 1.33x on the L4 and 1.25-1.27x on the current-gen L40S. Even the consumer RTX 5090 saw a 1.19-1.23x boost. This isn't just a win, it's a statement.
But here's where it gets wild. The performance isn't a clean cut based on bandwidth. The L40S, with 864 GB/s, beats the A10G, which has 600 GB/s. Why? It's class vs. class: inference vs. training. And that's a critical distinction.
The Bottom Line
So, where does AMK fall short? It trails behind cuBLAS on high-bandwidth training-class GPUs like A100 and H100. The bottleneck? Cross-SM-sync. But let's be honest, no one wins every battle.
And just like that, the leaderboard shifts. AMK has placed its flag firmly in the ground. The labs are scrambling to catch up. So here's the real question: How soon until others follow suit? In a world where every millisecond counts, AMK isn't just an option, it's a necessity.
Code and harness are available for the curious at GitHub. Because if you're not keeping up, you're falling behind.
Get AI news in your inbox
Daily digest of what matters in AI.