Production
Why most AI projects fail.
95% of enterprise AI projects never reach production. The reasons are predictable, and they are about engineering, not the model.
The 90% problem
A working demo is roughly 10% of an AI product. The remaining 90% is the gap between “works in this notebook” and “works in production for every user, every input, every day.” Demos show the cherry-picked case. Production sees every weird input. The fix is not a smarter model. It is engineering discipline.
The three structural reasons
- No evals.The team never defined what “good” looks like. Without an eval set, every change is blind. Every regression slips through. Every tuning attempt is superstition.
- Cost surprises.An agent runs in a loop, calls tools 12 times, burns €40 on one user. Multiplied across users, that’s a margin disaster. Most teams discover this at scale, not before launch.
- No traces.When a user reports a bad output, the team has no way to reconstruct what happened. Without observability, you can’t debug. You can’t debug, you can’t improve.
The production triad
Three things every serious AI team has, and every failing AI team lacks. If a team you fund cannot show you all three, they are building a demo, not a product.
1. Eval. Define it.
A golden dataset of inputs and expected outputs. Test cases that catch regressions. LLM-as-judge with audit on hard cases. Without evals, every change is a guess.
The discipline: you cannot ship a feature whose definition of “working” you have not written down. For an LLM-driven feature, that means writing the eval set before the prompt.
2. Trace. See it.
Every agent run produces a trace: the prompts, tool calls, model responses, latencies, costs, errors. Tools that do this well in 2026: Langfuse, LangSmith, Helicone, Arize Phoenix, Weights & Biases Weave.
With traces, a bug report becomes diagnosable. Without them, you are guessing.
3. Loop. Improve it.
Build → eval → fix → ship. Then repeat with a tighter eval. This is the “AI engineering loop.” It looks like normal engineering with one extra step (the eval), and that step is what separates AI products that improve from AI products that drift.
The cost surprise pattern
A common shape: a chat product launches at €0.001 per conversation. Engineers add an agent loop that retrieves documents and reasons. Cost grows to €0.05 per conversation. Then someone suggests reasoning models. Cost grows to €0.30 per conversation. At one million conversations a day, you have built a product that loses €300,000 per day at the model cost alone.
Hallucination-aware UX
For Family 3 features, hallucination is not optional. It is the default. The product’s job is to make it safe.
Patterns that work:
- Citations as primary UX. Show the sources the model used. Let the user verify.
- Confidence-aware output.When the model is uncertain, the UI shows it. “This may be wrong” is a legitimate UX state.
- Human-in-the-loop on irreversible actions. An agent should not delete a customer record without confirmation. Ever.
- Refusal as a feature.When the model can’t answer well, “I don’t know” is the right answer. Reward the team for shipping refusal.
The seven-point production gate
Before any AI feature ships:
- The eval set exists, has at least 50 cases, and the feature passes a target threshold.
- Cost-per-task at 10× volume is projected and acceptable.
- P95 latency is measured under load and meets the budget.
- Tracing is wired end-to-end. A bug report can be reproduced from its trace.
- A red-team pass surfaced and patched the obvious prompt injection and jailbreak vectors.
- Hallucination-aware UX is implemented. Citations, confidence, refusal, human-in-the-loop on irreversible actions.
- A weekly eval re-run is scheduled. Drift is detected, not discovered.
Miss any one and the project joins the 95% that doesn’t ship.
