Family 2 of 4

Specialist deep learning.

One specialist model per modality. Vision, time series, speech, recommenders, anomalies. The family that quietly powers products you use every day.

Thesis

Family 2 is a collection of deep learning models, each engineered for a specific modality. CNNs for images. YOLO for object detection. Whisper for speech. TFT for time series. They beat LLMs and VLMs on their home turf because they were built for it: faster, cheaper, and usually more accurate at the modality-specific task.

Once we have multimodal LLMs, we won’t need specialists.Specialists win wherever real-time, edge, or cost-per-inference matters. That is most of production.

The major sub-families

Computer vision

YOLO (object detection), U-Net and SAM 2 (segmentation), DINOv2 (universal vision features), RT-DETR and RF-DETR (modern transformer-based detection). YOLO ships real-time object detection on consumer hardware and runs fully on-device. RF-DETR reaches 54.7% mAP at under 5ms latency on COCO. Hosted vision-LLM calls take hundreds of milliseconds per image and send the image off-device.

Time series and forecasting

Classical baselines: ARIMA, Prophet, exponential smoothing. Modern neural: N-BEATS, TFT, TiDE. Foundation models for time: TimesFM, Chronos, Lag-Llama, Moirai. Throwing GPT-4 at a forecast is a beginner mistake; temporal data has structure (autocorrelation, seasonality) that generative models are not designed to exploit.

Recommender systems

Two-tower retrieval returns top items in single-digit milliseconds across catalogs of 10M+ items. That cost profile is not what LLMs are built for. SASRec, BERT4Rec for sequential. New generative-recsys frontier: HSTU (Meta) using transformer architectures but not as LLMs.

Speech (ASR + TTS)

Whisper v3 / Turbo democratized speech-to-text. Modern TTS: VITS, XTTS, F5-TTS (flow matching). A voice agent is ASR + LLM + TTS pipelined.

Anomaly detection

Isolation Forest, autoencoders, statistical control charts. For fraud, intrusion, manufacturing defects. Often hybridized with Family 1 baselines.

The decision rule

If your problem has...	Family 2?
Real-time latency requirement (<100ms)	Yes
Edge / on-device deployment (privacy, offline)	Yes
High volume (millions of inferences per day)	Yes
Modality-specific task (vision, speech, time, recsys)	Yes
Need to handle open-ended natural-language input	No (Family 3)

When NOT to use it

Family 2 specialists are narrow. A YOLO model trained to detect cars won’t find dogs. If your input is messy, multi-modal, or open in shape, you may need Family 3 or a Family 2 + Family 3 pipeline (a voice agent is a classic example: ASR feeds an LLM feeds TTS).

Named exemplars

Tesla Autopilot. Multiple specialist CV models running at ~30Hz on the in-vehicle compute.
Spotify recommendations. Two-tower retrieval plus ranking models. Family 2 on a billion-item catalog.
Amazon next-day forecasts. Tree of specialist models per warehouse and product category.
Apple Voice Memos transcription. Whisper-class on-device ASR.

Common trapTeams replace working Family 2 systems with LLM pipelines and ship slower, more expensive products. The LLM is rarely more accurate on the original task. Audit before you migrate.