Apple's New AI Models Challenge the DRAM Ceiling

Apple's third-gen AI models, developed with Google, sidestep DRAM limits, enabling massive on-device AI. But deployment hurdles remain.
Apple is shaking things up again. This time, it's by breaking the DRAM ceiling with its third-generation foundation AI models. Announced at WWDC26, these models do something previously thought impossible: they move AI's weight set off DRAM entirely. For enterprise architects, this means a fresh choice between solid cloud-dependent models and now, some seriously beefed-up on-device options.
New Architecture, New Possibilities
The AFM 3 family of models, developed with a little help from Google, spans across five models. Two are on-device, while three sit comfortably in Apple's Private Cloud Compute boundary. The on-device stars here? The AFM 3 Core Advanced, packing a whopping 20 billion parameters. Instead of cramming into the tight space of DRAM, these parameters live in NAND flash.
You might be thinking, "So what?" Well, remember the memory wall every local AI developer hits? Apple is jumping right over that with some pretty radical architecture shifts. According to Awni Hannun, a former Apple research scientist, a small model now predicts which experts to load from NAND to RAM. That's a clever workaround to the age-old problem of DRAM space.
The Devil's in the Details
Apple's approach, dubbed Instruction-Following Pruning (IFP), stores the full weight set in flash, with DRAM acting as a buffer for the necessary experts. This isn't your typical Mixture of Experts (MoE) model, which selects different experts for every token generated. Instead, Apple's model routes once at the prompt, loads the needed experts, and runs the show from there.
But Apple's left a few details out of the limelight. While we know the memory design and sparse activation mechanism, Apple hasn't said much about deployment constraints. Questions about energy usage, memory bandwidth, and thermal performance hang in the air like a cloud of unspoken doubt. Marco Abis, who’s knee-deep in developing Ziraph, points out the missing metrics that could make or break on-device performance.
What It Means for Enterprises
For regulated industries eyeing AI deployments, there's a new architectural decision to weigh: The DRAM wall for on-device agents just moved. Enterprises now have a 20-billion-parameter local option. But there's a catch. Apple hasn't spilled when a request offloads or if routing is visible to developers. That's a compliance headache waiting to happen.
Then there's the Google Cloud dependency. AFM 3 Cloud Pro runs on Nvidia GPUs within Google Cloud, which may raise some eyebrows. It’s a private cloud, sure, but the dependency remains. Whether you like it or not, Google's got a hand in the server-side game.
So, here's the big question: Can Apple deliver these models at scale? With a technical report due this summer, maybe we'll finally get some answers. But until then, enterprise architects have yet another thing to chew on.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
The processing power needed to train and run AI models.
An architecture where multiple specialized sub-networks (experts) share a model, but only a few activate for each input.
The dominant provider of AI hardware.
A value the model learns during training — specifically, the weights and biases in neural network layers.