An architecture where multiple specialized sub-networks (experts) share a model, but only a few activate for each input.
An architecture where multiple specialized sub-networks (experts) share a model, but only a few activate for each input. A router decides which experts to use. This lets models be massive in total parameters while keeping compute per token manageable. Mixtral and reportedly GPT-4 use this approach.
Mixture of Experts (MoE) is an architecture where a model contains multiple specialized sub-networks ("experts") and a routing mechanism that selects which experts to activate for each input. Instead of running the entire model for every token, only a fraction of the parameters are active at any time. This lets you build models with enormous total parameter counts while keeping compute costs manageable.
GPT-4 is widely believed to be a MoE model, reportedly with 8 experts and about 220 billion parameters each (1.76 trillion total), but only 2 experts activate per token. Mixtral from Mistral AI openly uses MoE with 8 experts, 7B each, activating 2 at a time — so it has 47B total parameters but uses only about 13B per token, making it much faster than a dense 47B model.
The routing decision is where the magic happens. A small gating network looks at each token and decides which experts should handle it. Different experts naturally specialize in different things — one might handle code, another might be better at reasoning, another at languages. The challenge is training the router well and ensuring all experts get used ("load balancing"). When it works, MoE gives you the quality of a huge model at the speed of a much smaller one.
"Mixtral uses a Mixture of Experts approach — it has 47B total parameters but only activates 13B per token, which is why it's nearly as good as LLaMA 70B but much faster."
The neural network architecture behind virtually all modern AI language models.
A mathematical function applied to a neuron's output that introduces non-linearity into the network.
An optimization algorithm that combines the best parts of two other methods — AdaGrad and RMSProp.
Artificial General Intelligence.
An autonomous AI system that can perceive its environment, make decisions, and take actions to achieve goals.
The research field focused on making sure AI systems do what humans actually want them to do.
Browse our complete glossary or subscribe to our newsletter for the latest AI news and insights.