Family 4 of 4

Reinforcement learning.

Sequential decisions with reward signals. Trading, robotics, control systems, increasingly recsys. Rare in production, irreplaceable when it fits.

Thesis

Reinforcement learning is the family for problems where you make a sequence of decisions and can measure the outcome. Classification models predict labels; RL models act. The policy is what decides what to do at each step.

Most companies will never deploy a Family 4 system directly. But you should be able to recognize when one is the right answer.

The mental model

State, action, reward, policy. The agent observes the state of the world, picks an action, gets a reward (or penalty), updates its policy, repeats. Over millions of episodes, the policy improves at maximizing cumulative reward.

The defining property: you cannot supervise the right answer. There is no “correct” trade or robot motion in advance. The environment supplies signal through outcomes.

The major sub-families

The hidden connection to Family 3

Worth knowingRLHF (reinforcement learning from human feedback) is what made ChatGPT useful. The helpfulness of Family 3 is powered by Family 4.

Reasoning models (o-series, DeepSeek-R1) take this further: they are trained with RL using verifiable rewards (RLVR) to learn how to think. The frontier of LLM training is the frontier of RL.

The decision rule

If your problem has...Family 4?
A sequence of decisions over timeMaybe
A clear, measurable reward signalYes (necessary)
A simulator or cheap ability to try thingsYes
Single-shot predictionNo (Family 1, 2, or 3)
No measurable outcomeNo
Cannot afford millions of trial episodesNo (use offline RL or imitation learning)

When NOT to use it

RL is the hardest family to deploy. Reward design is harder than it looks. Agents reliably find ways to maximize a misspecified reward (“reward hacking”). Sample efficiency is poor. You need millions of episodes or a high-fidelity simulator. Most decision-making problems can be re-cast as classification (Family 1) or generation (Family 3) and shipped faster.

Named exemplars

Hiring signalRL engineers are scarce and expensive. If a startup pitches you “we use RL to optimize X” without a clear simulator and a verifiable reward, ask what their reward function looks like and how they measure reward hacking. The answer separates real RL teams from marketing.
Mario Deubler

If this matches what your team is hitting

Series A founders and Heads of Product working through these symptoms (teams shipping fast, numbers flat), talk to me. I run as Fractional Head of Product, embedded with your team. Lead and build, not PowerPoint.