What They Are
A transformer is a type of neural network architecture introduced in the 2017 paper "Attention Is All You Need" by researchers at Google. The key innovation: instead of processing text one word at a time (like older RNN models), transformers process entire sequences at once using a mechanism called attention.
Attention lets every word in a sentence directly interact with every other word. When processing "The dog didn't cross the street because it was too wide," attention helps the model figure out that "it" refers to "street" (not "dog") because of the word "wide." This ability to capture long-range relationships is what makes transformers so powerful.
Why They Changed Everything
Before transformers, recurrent neural networks (RNNs) and LSTMs were the go-to for language tasks. They processed text sequentially — one word after another — which was slow and struggled with long sequences. By the time an RNN reached the end of a long paragraph, it had often "forgotten" the beginning.
Transformers solved both problems. They process all words in parallel (much faster, especially on GPUs), and attention lets every word access every other word regardless of distance. This parallelism also meant transformers could be scaled up dramatically — which is exactly what happened with GPT-3, GPT-4, and their competitors.
The architecture turned out to be remarkably versatile. Originally designed for translation, transformers now dominate language, vision, audio, code generation, protein folding, weather prediction, and more.
How They Work (Simplified)
1. Tokenization and Embedding. Input text gets split into tokens, and each token is converted into a numerical vector. Positional encodings are added so the model knows the order of words.
2. Self-attention. Each token creates three vectors: a Query (what am I looking for?), a Key (what do I represent?), and a Value (what information do I carry?). Each token's Query is compared against every other token's Key to calculate attention scores. High scores mean strong relevance. These scores determine how much each token contributes to the output.
3. Multi-head attention. The model runs multiple attention operations in parallel (the "heads"). Each head can learn to pay attention to different types of relationships — syntax, semantics, coreference, etc.
4. Feed-forward layers. After attention, the output passes through standard neural network layers that add additional processing power.
5. Stack and repeat. These attention + feed-forward blocks are stacked many times. GPT-3 has 96 layers. More layers = more capacity to learn complex patterns.
Flavors of Transformers
Encoder-only (BERT): Processes input text and produces understanding. Good for classification, search, and analysis. BERT reads the whole input at once.
Decoder-only (GPT, Claude): Generates text one token at a time, left to right. Each token can only attend to tokens that came before it. This is what powers chatbots and text generation.
Encoder-decoder (T5, original transformer): Uses both. The encoder processes the input, the decoder generates the output. Good for translation and summarization.
Key Examples
GPT-4: A decoder-only transformer with reportedly over a trillion parameters. Powers ChatGPT.
Claude: Anthropic's family of decoder-only transformers, trained with constitutional AI for safety.
BERT: Google's encoder-only transformer that transformed search. It understands search queries in context rather than matching keywords.
Vision Transformers (ViT): Applies the transformer architecture to images, splitting them into patches and processing them like text tokens.
Where to Go Next
- → Large Language Models — transformers at scale
- → Embeddings — how text becomes vectors
- → Deep Learning — the broader field
- → How AI Models Are Trained — training these massive models