What does this AI glossary cover?

Machine Brief's AI glossary covers 175+ terms spanning machine learning, deep learning, natural language processing, computer vision, generative AI, and AI safety.

Is this glossary free?

Yes, Machine Brief's AI glossary is 100% free to use. No account or signup required.

Who is this glossary for?

Anyone who wants to understand AI terminology — from complete beginners to engineers switching into AI.

What concepts are related to Mixture of Experts?

Key concepts related to Mixture of Experts include: Transformer, Activation Function, Adam Optimizer, AGI, AI Agent, AI Alignment. Understanding these related terms helps build a deeper knowledge of ai and how Mixture of Experts fits into the broader ecosystem.

MACHINE BRIEF

Newsletter

Back to Glossary

Mixture of Experts

An architecture where multiple specialized sub-networks (experts) share a model, but only a few activate for each input.

Definition

An architecture where multiple specialized sub-networks (experts) share a model, but only a few activate for each input. A router decides which experts to use. This lets models be massive in total parameters while keeping compute per token manageable. Mixtral and reportedly GPT-4 use this approach.

How It Works

Mixture of Experts (MoE) is an architecture where a model contains multiple specialized sub-networks ("experts") and a routing mechanism that selects which experts to activate for each input. Instead of running the entire model for every token, only a fraction of the parameters are active at any time. This lets you build models with enormous total parameter counts while keeping compute costs manageable.

GPT-4 is widely believed to be a MoE model, reportedly with 8 experts and about 220 billion parameters each (1.76 trillion total), but only 2 experts activate per token. Mixtral from Mistral AI openly uses MoE with 8 experts, 7B each, activating 2 at a time — so it has 47B total parameters but uses only about 13B per token, making it much faster than a dense 47B model.

The routing decision is where the magic happens. A small gating network looks at each token and decides which experts should handle it. Different experts naturally specialize in different things — one might handle code, another might be better at reasoning, another at languages. The challenge is training the router well and ensuring all experts get used ("load balancing"). When it works, MoE gives you the quality of a huge model at the speed of a much smaller one.

Example Usage

"Mixtral uses a Mixture of Experts approach — it has 47B total parameters but only activates 13B per token, which is why it's nearly as good as LLaMA 70B but much faster."

Share this term

Related Terms

Transformer

The neural network architecture behind virtually all modern AI language models.

Activation Function

A mathematical function applied to a neuron's output that introduces non-linearity into the network.

Adam Optimizer

An optimization algorithm that combines the best parts of two other methods — AdaGrad and RMSProp.

AGI

Artificial General Intelligence.

AI Agent

An autonomous AI system that can perceive its environment, make decisions, and take actions to achieve goals.

AI Alignment

The research field focused on making sure AI systems do what humans actually want them to do.

Explore More

Latest News AI News Markets Analysis Full Glossary

Want to learn more about AI?

Browse our complete glossary or subscribe to our newsletter for the latest AI news and insights.

Browse Glossary Subscribe to Newsletter