Family 4 of 4

Reinforcement learning.

Sequential decisions with reward signals. Trading, robotics, control systems, increasingly recsys. Rare in production, irreplaceable when it fits.

Thesis

Reinforcement learning is the family for problems where you make a sequence of decisions and can measure the outcome. Classification models predict labels; RL models act. The policy is what decides what to do at each step.

Most companies will never deploy a Family 4 system directly. But you should be able to recognize when one is the right answer.

The mental model

State, action, reward, policy. The agent observes the state of the world, picks an action, gets a reward (or penalty), updates its policy, repeats. Over millions of episodes, the policy improves at maximizing cumulative reward.

The defining property: you cannot supervise the right answer. There is no “correct” trade or robot motion in advance. The environment supplies signal through outcomes.

The major sub-families

Value-based. Q-learning, DQN. Learn the value of each state-action pair.
Policy gradient. PPO, SAC, TD3. Directly optimize the policy.
Model-based. MuZero, Dreamer-V3. Learn a model of the environment, plan inside it.
Multi-agent. Multiple RL agents interact. Equilibrium dynamics matter.
RLHF and RLVR. RL applied to language models. The connection back to Family 3.

The hidden connection to Family 3

Worth knowingRLHF (reinforcement learning from human feedback) is what made ChatGPT useful. The helpfulness of Family 3 is powered by Family 4.

Reasoning models (o-series, DeepSeek-R1) take this further: they are trained with RL using verifiable rewards (RLVR) to learn how to think. The frontier of LLM training is the frontier of RL.

The decision rule

If your problem has...	Family 4?
A sequence of decisions over time	Maybe
A clear, measurable reward signal	Yes (necessary)
A simulator or cheap ability to try things	Yes
Single-shot prediction	No (Family 1, 2, or 3)
No measurable outcome	No
Cannot afford millions of trial episodes	No (use offline RL or imitation learning)

When NOT to use it

RL is the hardest family to deploy. Reward design is harder than it looks. Agents reliably find ways to maximize a misspecified reward (“reward hacking”). Sample efficiency is poor. You need millions of episodes or a high-fidelity simulator. Most decision-making problems can be re-cast as classification (Family 1) or generation (Family 3) and shipped faster.

Named exemplars

Game-playing. AlphaGo, AlphaZero, MuZero (DeepMind).
Quant trading firms. Sequential trading decisions, often hybrid RL plus classical.
Robotics. Boston Dynamics, Tesla Optimus. Locomotion and manipulation.
Algorithm discovery. AlphaTensor, AlphaDev.
RLHF in chat models. ChatGPT, Claude, Gemini. In training, not in deployment.
Datacenter cooling at Google. RL controls cooling set-points.

Hiring signalRL engineers are scarce and expensive. If a startup pitches you “we use RL to optimize X” without a clear simulator and a verifiable reward, ask what their reward function looks like and how they measure reward hacking. The answer separates real RL teams from marketing.