What It Is
In supervised learning, you give the model correct answers. In reinforcement learning (RL), there are no correct answers. There's just an agent, an environment, and a reward signal.
The agent takes actions in the environment. After each action, it receives a reward (positive or negative). Over time, it learns which actions lead to higher rewards. It's learning by experience, not instruction.
Think about how you learned to ride a bike. Nobody gave you a dataset of correct pedaling patterns. You just tried stuff, fell over, adjusted, and gradually figured it out. That's reinforcement learning in a nutshell.
Why It Matters
RL shines in situations where the optimal strategy isn't known in advance. Games, robotics, resource allocation, traffic control — problems where you need to make sequences of decisions and the consequences of early decisions only become clear later.
But the biggest impact of RL today is actually in language models. RLHF (Reinforcement Learning from Human Feedback) is how ChatGPT and Claude were trained to be helpful and safe. It's the RL component that turns a raw text predictor into a useful assistant.
How It Works
Every RL system has these components:
Agent: The AI that makes decisions. It could be a game-playing program, a robot, or a language model.
Environment: Whatever the agent interacts with. A chess board, a simulated world, a conversation.
State: The current situation. In chess, it's the board position. In a game, it's the current frame.
Action: What the agent can do. Move a piece, turn left, generate a word.
Reward: Feedback after each action. Win the game: big positive reward. Lose: big negative. Capture a piece: small positive. The reward function defines what "good" means.
Policy: The agent's strategy — a mapping from states to actions. Training refines the policy to maximize cumulative reward over time, not just immediate reward. Sometimes sacrificing short-term gain for long-term benefit is the optimal play.
The tricky part is the exploration vs. exploitation tradeoff. Should the agent keep doing what it knows works (exploitation), or try something new that might be better (exploration)? Too much exploitation and it gets stuck in mediocre strategies. Too much exploration and it never capitalizes on what it's learned.
Key Examples
AlphaGo (2016): DeepMind's system beat Lee Sedol, one of the world's best Go players. Go has more possible positions than atoms in the universe, so brute-force search won't work. AlphaGo used RL to discover strategies that surprised even expert players.
AlphaZero: Took AlphaGo further. Starting with nothing but the rules of chess, Go, and shogi, it played millions of games against itself and became superhuman at all three within hours.
Robotics: RL teaches robots to walk, grasp objects, and navigate environments. Boston Dynamics and DeepMind use RL for locomotion. The advantage: robots can learn skills in simulation and transfer them to the real world.
ChatGPT: RLHF is what turned GPT from a text generator into a helpful assistant. Human evaluators rated responses, those ratings trained a reward model, and RL optimized the language model against that reward.
Where to Go Next
- → RLHF — RL applied to language model alignment
- → AI Agents — autonomous systems that use RL principles
- → Machine Learning — the broader field
- → AI Safety — making sure RL agents do what we want