What Training Actually Means
An untrained neural network is just random numbers. Training is the process of adjusting those numbers (called weights or parameters) so the model produces useful outputs. A model with 70 billion parameters has 70 billion numbers that need to land in roughly the right range. That's what training does.
For language models, training means showing the model trillions of words of text and having it predict what comes next, over and over, adjusting weights each time to get better at the prediction. After enough iterations, the model has internalized grammar, facts, reasoning patterns, and even some common sense.
The Training Pipeline
Modern AI training has several stages, each critical:
1. Data collection. You need data — lots of it. For language models, this means scraping the web, licensing books, ingesting code repositories, and gathering conversation data. Common Crawl, Wikipedia, GitHub, and published books are typical sources. Quality varies wildly, which is why the next step matters.
2. Data cleaning and filtering. Raw internet data is full of spam, duplicates, toxic content, and gibberish. Teams spend months cleaning datasets — removing duplicates, filtering out low-quality text, balancing representation across topics and languages. Data quality often matters more than quantity. Garbage in, garbage out.
3. Tokenization. Text gets split into tokens — roughly word fragments. "unbelievable" might become "un" + "believ" + "able." This is how models process language: not as characters or words, but as tokens. Most models use 30,000-100,000 unique tokens.
4. Pre-training. The main training phase. The model processes the entire dataset, predicting the next token at each step. When it's wrong, backpropagation adjusts the weights. This runs on clusters of hundreds or thousands of GPUs for weeks or months. A single GPT-4-scale run reportedly used around 25,000 A100 GPUs for about 100 days.
5. Fine-tuning. After pre-training, the model knows language but doesn't know how to be helpful. Fine-tuning on instruction-response pairs teaches it to follow directions, have conversations, and refuse harmful requests.
6. RLHF or similar alignment. Human raters compare model outputs and rank them by quality. This feedback trains a reward model, which is then used to further fine-tune the system. This step is what makes the difference between a model that can write and one that's actually useful.
The Hardware
Training modern AI models is a hardware problem as much as a software one. NVIDIA dominates with their A100 and H100 GPUs, which are designed for the matrix math that neural networks need. A single H100 costs around $30,000, and you need thousands of them.
The GPUs are connected by high-speed networking (InfiniBand or NVLink) that lets them share data during training. Model parallelism splits the model across multiple GPUs, while data parallelism processes different batches of data simultaneously. Getting this orchestration right is one of the hardest engineering challenges in AI.
Energy consumption is substantial. Training GPT-4 reportedly consumed around 50 GWh of electricity — enough to power 4,500 US homes for a year. This has raised legitimate questions about the environmental cost of AI development.
What Can Go Wrong
Training failures are expensive. A hardware fault after two months of training can corrupt the run. Teams use checkpointing — saving snapshots of the model periodically — so they don't have to start from zero. But even with checkpoints, debugging a failed training run on thousands of GPUs is brutal.
Other problems: the model might memorize training data instead of learning patterns (overfitting), develop biases present in the training data, or exhibit "training instabilities" where the loss suddenly spikes. Managing these issues is part science, part engineering, part dark art.
Where to Go Next
- → Fine-Tuning — adapting pre-trained models for specific tasks
- → RLHF — training with human feedback
- → AI Benchmarks — how we measure if training worked
- → Deep Learning — the technology being trained