A model that generates output one piece at a time, with each new piece depending on all the previous ones.
A model that generates output one piece at a time, with each new piece depending on all the previous ones. GPT and other large language models work this way — they predict the next token based on everything that came before it. Great for text generation, but inherently sequential.
Autoregressive models generate output one piece at a time, where each new piece depends on everything that came before it. In language models, this means predicting one token, appending it to the sequence, then predicting the next token, and so on. It's like writing a sentence where each word choice constrains what comes next.
This approach is what makes chatbots feel like they're "thinking" as they type — they literally are generating one token at a time. GPT, Claude, LLaMA, and virtually every modern language model works this way. The tradeoff is speed: since each token depends on the previous ones, you can't easily parallelize generation. That's why responses take time to stream in.
The alternative approaches — like masked language models (BERT) or diffusion models — work differently. BERT predicts missing words in the middle of sentences (good for understanding, not great for generation). Diffusion models start with noise and refine it into output. But for open-ended text generation, autoregressive models still dominate because they naturally capture the left-to-right flow of language.
"Claude is an autoregressive model — it generates each word based on everything before it, which is why you see responses appear token by token."
The neural network architecture behind virtually all modern AI language models.
The fundamental task that language models are trained on: given a sequence of tokens, predict what comes next.
An AI model that understands and generates human language.
A mathematical function applied to a neuron's output that introduces non-linearity into the network.
An optimization algorithm that combines the best parts of two other methods — AdaGrad and RMSProp.
Artificial General Intelligence.
Browse our complete glossary or subscribe to our newsletter for the latest AI news and insights.