May 28, 20266 min read

The AI Agent Reliability Gap: Demos Work, Launches Don't

AI agent reliability, explained: why demos succeed while agents in production fail, why the demo is a sampling statement, and how to close the gap first.

Nik Kowalsi

Reliability Agent testing

AI agents pass demos and fail launches because a demo samples a tiny, friendly slice of inputs — author-phrased prompts, happy paths, single turns — while production samples thousands of distinct people with unfamiliar phrasing, follow-ups, and edge contexts. The agent doesn't change between demo and launch; the input distribution does. Closing the gap means testing on a production-shaped distribution first.

"It worked in the demo" is a sampling statement

When someone says an agent works, they're compressing a statistical claim: over some set of inputs, the agent produced acceptable outputs. The entire meaning hides in "some set."

A demo's input set is small and exquisitely biased. Five to twenty exchanges, phrased by the people who built the system, about scenarios chosen because they show well, almost always one turn deep, judged by an audience that wants to be impressed. That's not a dishonest sample — demos exist to show what's possible — but it's a sample from a distribution that will never occur again after launch day.

So "it worked in the demo" translates to: on a non-random sample of ~20 author-generated inputs, we observed zero failures. As evidence about behavior across thousands of real users, that's close to no evidence at all. It establishes existence ("there are inputs this agent handles well"), not reliability ("most inputs, for most people, get handled well"). Deterministic software lets you blur that line, because correct-once usually means correct-always for that code path. LLM agents revoke the privilege: behavior varies with phrasing, context, history, and sampling noise, so which inputs you tried is the whole ballgame.

Demo conditions vs. the production distribution

Line them up and the gap stops being mysterious:

Dimension	The demo	Production
Who's asking	The authors and people like them	Thousands of distinct people
Phrasing	Clear, well-formed, prompt-literate	Terse, ambiguous, non-native, typo-ridden
Scenario	Chosen because it works	Whatever life serves up
Turns	One, maybe two	Follow-ups, corrections, "wait, also…"
Context	Fresh session, clean state	Mid-problem, prior history, missing info
Audience	Wants it to work	Wants their problem solved, now
Failure cost	A chuckle, a re-roll	A churned customer, a screenshot

Two rows deserve special attention.

Phrasing. Production users are adversarially bad phrasers without adversarial intent. Nobody is trying to break your agent — they're typing "it didnt work" with no antecedent, pasting an error message with no question, asking in their second language, or writing one word: "refund?" The demo never contains these inputs because the authors are incapable of producing them: building the system makes you permanently unrepresentative of its users. You know the magic words. Your users don't — that's why they're asking.

Turns. Demos end at the applause line. Conversations don't. The follow-up, the correction, the topic change, the return visit — reliability lives in turns two through ten, where context accumulates and small errors compound. An agent with a modest per-turn error rate looks flawless in a one-turn demo and shaky by the fifth exchange; the math of compounding does the rest. (This is the "conversation failure" class in why AI agents fail.)

Why the gap survives good intentions

Teams don't skip testing out of laziness. The gap persists because of three structural traps:

The author trap. Whoever writes the test set imprints their vocabulary, assumptions, and patience on it. A hundred hand-written test cases are, distributionally, one person asking a hundred questions. You've scaled the sample size without touching the sample bias.

The average trap. Even teams that build large eval sets usually read one number. An aggregate score is a claim about the mean of a distribution you didn't design; it happily conceals an agent that's excellent for the median user and broken for entire cohorts — a language, an age band, an income situation. Means don't launch products; distributions do.

The regression trap. Non-determinism makes failures slippery. A bad conversation happens once, can't be reproduced, gets shrugged off as a fluke — and ships. Without reproducibility, there's no ratchet: nothing guarantees that a fixed failure stays fixed. (More on this in regression testing non-deterministic agents.)

Closing the gap: test on a production-shaped distribution

You can't test on production's exact distribution before production exists. But you can get much closer than "the founders typed some questions." The principle: make your test population resemble your user population in the dimensions that change agent behavior — who people are, how they phrase, what context they carry, and how conversations unfold.

Concretely:

Test with many distinct users, not many prompts. The unit of testing is a person with context — an age, a language, an income, a situation, a communication style — because those attributes drive phrasing and expectations. Diversity of people generates diversity of inputs that no author can hand-write.
Ground the population in real statistics. If you invent your test users, you've reintroduced the author trap one level up. Anchoring the population to real demographic ground truth — the approach Synthetic Signals takes with a Census-grounded synthetic San Francisco of thousands of distinct residents — means the mix of ages, languages, and household situations reflects reality rather than imagination.
Run whole conversations. Follow-ups, corrections, second sessions. If your test users can't remember and return, you're still demoing.
Read results by cohort, never as one number. The launch question isn't "what's the score?" but "who fails?" — a coverage breakdown by age, language, and income turns a flattering 92% into the actionable "61% for one cohort you'd never have met before launch."
Make every failure permanent. Seeded, reproducible runs — same seed, same city, same people — turn any discovered failure into a regression test, building the ratchet that non-determinism otherwise denies you.
Gate the launch on the distribution. Ship when the worst-served cohort clears your bar, not when the average does.

For the full pre-launch workflow, see how to test AI agents before production.

What this doesn't fix

Honesty about limits:

Simulation approximates. A synthetic population is production-shaped, not production. Real users will produce phrasings and contexts no simulation sampled. The goal is shrinking the surprise, not abolishing it.
Your scoring lens is still on you. A production-shaped distribution generates realistic conversations; deciding what counts as success is a separate problem (rubrics, LLM judges, custom metrics — see LLM-as-a-judge), and a bad lens misgrades good coverage.
Operational reliability is a different discipline. Latency, rate limits, provider outages, and genuinely adversarial humans live outside behavioral testing.
Distributions drift. Your users in month six aren't your users at launch. Population testing is a practice, not a checkbox.

The demo isn't the problem

Nothing here says demos are bad. Demos answer a real question — can this work? — and answering it is how projects earn the right to continue. The failure is epistemic: promoting a possibility proof into a reliability claim. The teams that launch well hold both statements separately. The demo says the agent can work. Only a production-shaped test can say it probably will — for the actual, various, impatient, wonderfully non-median people about to show up.

Between those two statements is the reliability gap. Cross it before your users measure it for you.

FAQ

What is the AI agent reliability gap?

It's the gap between how an agent performs under demo conditions — author-phrased prompts, happy paths, single turns, a forgiving audience — and how it performs against the production distribution of real users. The agent doesn't change between demo day and launch day; the input distribution does.

Why do AI agents work in demos but fail in production?

Because a demo samples a tiny, biased slice of inputs: questions phrased by the people who built the agent, about scenarios it handles well, usually one turn deep. Production samples thousands of distinct people with unfamiliar phrasing, missing context, follow-ups, and edge cases the demo never touched.

How do you make an AI agent reliable before production?

Test it against a production-shaped distribution instead of a hand-picked script: many distinct simulated users, multi-turn conversations, results broken down by cohort rather than averaged, and reproducible runs so every failure becomes a regression test.

Is a successful demo evidence that an agent works?

It's weak evidence. A demo proves existence — there are inputs the agent handles well — not reliability, which is a claim about the proportion and distribution of inputs it handles. Those are different statistical statements, and conflating them is why launches surprise teams.