Reinforcement Learning from Human Feedback. The technique that makes language models actually useful as assistants. Humans rank model outputs by quality, a reward model learns these preferences, and the language model is fine-tuned to maximize the reward. How ChatGPT, Claude, and others learned to be helpful.
RLHF (Reinforcement Learning from Human Feedback) is the training technique that turns a raw language model into a useful assistant. After pre-training on text prediction, the model generates responses that are technically fluent but not necessarily helpful or safe. RLHF uses human preferences to teach the model what "good" responses look like.
The process has three steps. First, human raters compare pairs of model responses and pick which one is better. Second, these preferences train a reward model — a separate model that predicts how highly a human would rate any given response. Third, the language model is fine-tuned using reinforcement learning (specifically PPO — Proximal Policy Optimization) to maximize the reward model's score. The result is a model that generates responses humans actually prefer.
RLHF is what made the difference between GPT-3 (impressive but unwieldy) and ChatGPT (useful and approachable). It's also imperfect. The model can learn to produce responses that look good to the reward model without actually being better — a problem called reward hacking. Alternatives and improvements include DPO (Direct Preference Optimization), which skips the reward model entirely, and Constitutional AI, which reduces the need for human labelers. But the core idea — learning from human preferences — remains central to making AI systems that people actually want to use.
"The base model wrote technically correct but robotic text. After RLHF training with human preferences, it became conversational and actually helpful."
A learning approach where an agent learns by interacting with an environment and receiving rewards or penalties.
A model trained to predict how helpful, harmless, and honest a response is, based on human preferences.
Fine-tuning a language model on datasets of instructions paired with appropriate responses.
A mathematical function applied to a neuron's output that introduces non-linearity into the network.
An optimization algorithm that combines the best parts of two other methods — AdaGrad and RMSProp.
Artificial General Intelligence.
Browse our complete glossary or subscribe to our newsletter for the latest AI news and insights.