Family 4 of 4
Reinforcement learning.
Sequential decisions with reward signals. Trading, robotics, control systems, increasingly recsys. Rare in production, irreplaceable when it fits.
Thesis
Reinforcement learning is the family for problems where you make a sequence of decisions and can measure the outcome. Classification models predict labels; RL models act. The policy is what decides what to do at each step.
Most companies will never deploy a Family 4 system directly. But you should be able to recognize when one is the right answer.
The mental model
State, action, reward, policy. The agent observes the state of the world, picks an action, gets a reward (or penalty), updates its policy, repeats. Over millions of episodes, the policy improves at maximizing cumulative reward.
The defining property: you cannot supervise the right answer. There is no “correct” trade or robot motion in advance. The environment supplies signal through outcomes.
The major sub-families
- Value-based. Q-learning, DQN. Learn the value of each state-action pair.
- Policy gradient. PPO, SAC, TD3. Directly optimize the policy.
- Model-based. MuZero, Dreamer-V3. Learn a model of the environment, plan inside it.
- Multi-agent. Multiple RL agents interact. Equilibrium dynamics matter.
- RLHF and RLVR. RL applied to language models. The connection back to Family 3.
The hidden connection to Family 3
Reasoning models (o-series, DeepSeek-R1) take this further: they are trained with RL using verifiable rewards (RLVR) to learn how to think. The frontier of LLM training is the frontier of RL.
The decision rule
| If your problem has... | Family 4? |
|---|---|
| A sequence of decisions over time | Maybe |
| A clear, measurable reward signal | Yes (necessary) |
| A simulator or cheap ability to try things | Yes |
| Single-shot prediction | No (Family 1, 2, or 3) |
| No measurable outcome | No |
| Cannot afford millions of trial episodes | No (use offline RL or imitation learning) |
When NOT to use it
RL is the hardest family to deploy. Reward design is harder than it looks. Agents reliably find ways to maximize a misspecified reward (“reward hacking”). Sample efficiency is poor. You need millions of episodes or a high-fidelity simulator. Most decision-making problems can be re-cast as classification (Family 1) or generation (Family 3) and shipped faster.
Named exemplars
- Game-playing. AlphaGo, AlphaZero, MuZero (DeepMind).
- Quant trading firms. Sequential trading decisions, often hybrid RL plus classical.
- Robotics. Boston Dynamics, Tesla Optimus. Locomotion and manipulation.
- Algorithm discovery. AlphaTensor, AlphaDev.
- RLHF in chat models. ChatGPT, Claude, Gemini. In training, not in deployment.
- Datacenter cooling at Google. RL controls cooling set-points.
