The basic unit of text that language models work with. Not quite a word — tokens can be whole words, parts of words, or even single characters. 'Understanding' might be one token; 'un' + 'der' + 'standing' might be three. Most models process about 1.3 tokens per English word. Token limits define context windows.
A token is the basic unit of text that language models process. Models don't read characters or words — they read tokens, which are chunks of text determined by the model's tokenizer. Common words like "the" or "hello" are usually single tokens. Less common words get split: "tokenization" might become "token" + "ization." Numbers, punctuation, and code each have their own tokenization patterns.
Token counts matter for practical reasons. Context windows are measured in tokens, not words. API pricing is per token. A rough rule of thumb: one token equals about 0.75 English words, or about 4 characters. So a 1,000-word document is roughly 1,333 tokens. Different models use different tokenizers, so the exact count varies. GPT and Claude use different tokenization schemes and will produce different token counts for the same text.
The tokenizer design significantly affects model performance. Models struggle more with text that requires many tokens per word — which is one reason LLMs historically performed worse on non-English languages (more tokens needed per word) and code (lots of unusual character sequences). Newer tokenizers are designed with better multilingual and code coverage. Understanding tokenization helps you write better prompts — keeping things concise saves tokens and money, and avoiding unusual formatting reduces tokenization artifacts.
"Our API call used 4,200 tokens — 3,800 for the prompt and 400 for the response. At $0.01 per 1K tokens, that's about 4 cents per request."
The component that converts raw text into tokens that a language model can process.
The maximum amount of text a language model can process at once, measured in tokens.
An AI model that understands and generates human language.
A mathematical function applied to a neuron's output that introduces non-linearity into the network.
An optimization algorithm that combines the best parts of two other methods — AdaGrad and RMSProp.
Artificial General Intelligence.
Browse our complete glossary or subscribe to our newsletter for the latest AI news and insights.