What It Is
A pre-trained language model can complete text, but it doesn't know how to be useful. Ask it a question, and it might continue writing as if it's completing a textbook page. Or it might generate harmful content, because harmful content exists in its training data.
RLHF solves this by using human judgments to teach the model what "good" behavior looks like. Humans rate model outputs, those ratings train a reward model, and the reward model guides reinforcement learning that adjusts the language model's behavior. The result: a model that's helpful, follows instructions, and avoids harmful outputs.
Why It Matters
RLHF is the single most important technique in making AI assistants actually assistive. Without it, you'd have a model that can generate text but doesn't understand the difference between a helpful answer and a harmful one. RLHF is what makes the model say "I can't help with that" when asked to do something dangerous, and "Here's a step-by-step solution" when asked for legitimate help.
It's also a key part of AI safety. RLHF is how we encode human values and preferences into AI systems. It's imperfect — human preferences are inconsistent, and reward models can be gamed — but it's the best alignment technique we have at scale.
How It Works
RLHF has three stages:
Step 1: Supervised fine-tuning. Start with a pre-trained model. Fine-tune it on high-quality instruction-response pairs written by humans. This teaches the model the basic format: receive a question, give a helpful answer.
Step 2: Train a reward model. Generate multiple responses to the same prompt. Have humans rank them from best to worst. Use these rankings to train a separate model — the reward model — that predicts how good a response is. The reward model learns human preferences: what's helpful, what's harmful, what's well-written.
Step 3: Optimize with RL. Use the reward model as a scoring function. Generate responses with the language model, score them with the reward model, and update the language model to produce higher-scoring responses. The algorithm typically used is PPO (Proximal Policy Optimization). A KL divergence penalty prevents the model from changing too much and losing its general abilities.
Alternatives and Variations
DPO (Direct Preference Optimization): Skips the separate reward model. Instead, it directly optimizes the language model using preference data. Simpler, cheaper, and increasingly popular. Many open-source models now use DPO instead of full RLHF.
Constitutional AI (Anthropic): Reduces dependence on human labelers. The model evaluates its own outputs against a set of written principles ("constitution") and improves based on that self-assessment. Still uses some RLHF, but less human labor.
RLAIF (RL from AI Feedback): Uses another AI model instead of humans to generate the preference data. Cheaper and faster, but you're training on AI judgments, which have their own biases and limitations.
Limitations
RLHF isn't perfect. Human evaluators disagree. The reward model can be "gamed" — the language model might learn to produce outputs that score well but aren't actually good (reward hacking). And RLHF can make models overly cautious, refusing legitimate requests because they superficially resemble harmful ones.
It's also expensive. Collecting high-quality human preferences requires trained annotators, and the RL training loop is computationally intensive. This is one reason DPO has gained traction — similar results with less overhead.
Where to Go Next
- → Reinforcement Learning — the RL in RLHF
- → Fine-Tuning — the supervised step before RL
- → AI Safety — why alignment matters
- → Large Language Models — the models being aligned