Family 2 of 4
Specialist deep learning.
One specialist model per modality. Vision, time series, speech, recommenders, anomalies. The family that quietly powers products you use every day.
Thesis
Family 2 is a collection of deep learning models, each engineered for a specific modality. CNNs for images. YOLO for object detection. Whisper for speech. TFT for time series. They beat LLMs and VLMs on their home turf because they were built for it: faster, cheaper, and usually more accurate at the modality-specific task.
The major sub-families
Computer vision
YOLO (object detection), U-Net and SAM 2 (segmentation), DINOv2 (universal vision features), RT-DETR and RF-DETR (modern transformer-based detection). YOLO ships real-time object detection on consumer hardware and runs fully on-device. RF-DETR reaches 54.7% mAP at under 5ms latency on COCO. Hosted vision-LLM calls take hundreds of milliseconds per image and send the image off-device.
Time series and forecasting
Classical baselines: ARIMA, Prophet, exponential smoothing. Modern neural: N-BEATS, TFT, TiDE. Foundation models for time: TimesFM, Chronos, Lag-Llama, Moirai. Throwing GPT-4 at a forecast is a beginner mistake; temporal data has structure (autocorrelation, seasonality) that generative models are not designed to exploit.
Recommender systems
Two-tower retrieval returns top items in single-digit milliseconds across catalogs of 10M+ items. That cost profile is not what LLMs are built for. SASRec, BERT4Rec for sequential. New generative-recsys frontier: HSTU (Meta) using transformer architectures but not as LLMs.
Speech (ASR + TTS)
Whisper v3 / Turbo democratized speech-to-text. Modern TTS: VITS, XTTS, F5-TTS (flow matching). A voice agent is ASR + LLM + TTS pipelined.
Anomaly detection
Isolation Forest, autoencoders, statistical control charts. For fraud, intrusion, manufacturing defects. Often hybridized with Family 1 baselines.
The decision rule
| If your problem has... | Family 2? |
|---|---|
| Real-time latency requirement (<100ms) | Yes |
| Edge / on-device deployment (privacy, offline) | Yes |
| High volume (millions of inferences per day) | Yes |
| Modality-specific task (vision, speech, time, recsys) | Yes |
| Need to handle open-ended natural-language input | No (Family 3) |
When NOT to use it
Family 2 specialists are narrow. A YOLO model trained to detect cars won’t find dogs. If your input is messy, multi-modal, or open in shape, you may need Family 3 or a Family 2 + Family 3 pipeline (a voice agent is a classic example: ASR feeds an LLM feeds TTS).
Named exemplars
- Tesla Autopilot. Multiple specialist CV models running at ~30Hz on the in-vehicle compute.
- Spotify recommendations. Two-tower retrieval plus ranking models. Family 2 on a billion-item catalog.
- Amazon next-day forecasts. Tree of specialist models per warehouse and product category.
- Apple Voice Memos transcription. Whisper-class on-device ASR.
