June 29, 20267 min read

How to Test AI Agents Before Production

How to test AI agents before production: a 7-step method — define success, build a realistic user population, run multi-turn tests, score, and gate.

Nik Kowalsi

Agent testing AI evals

To test AI agents before production: define what success means for each task, assemble a test population that resembles your real users, run full multi-turn conversations instead of single prompts, score them with an explicit method, break results down by user cohort, freeze failures into a regression suite, and keep evaluating after launch.

That's the whole method in one sentence. The rest of this guide is how to actually do each step — with any tooling, including a spreadsheet and a script.

Why agents can't be tested like normal software

Three properties break conventional testing:

Non-determinism. The same input can produce different outputs run to run, so a single green test proves little. You need rates across samples, not single assertions.
Trajectories, not responses. An agent's output is a multi-step episode: tool calls, state changes, turns of conversation. Each step can look fine while the episode fails.
User-dependence. Agent quality varies with who's talking — their phrasing, patience, language, and context. A test suite that only speaks like your team tests a user base that doesn't exist.

The method below addresses all three. If you want the conceptual grounding first, read what AI evals are; this article is the applied version for agents.

Step 1: Define what "works" means

You cannot test toward an undefined target. Before any tooling:

List the tasks. The 5–15 things users will actually try to do, phrased as user goals ("get a refund for a duplicate charge"), not features.
Write success criteria per task. One or two sentences each: what must be true at the end of the conversation for it to count as success? Include the outcome ("refund issued to the correct card"), not just the vibe ("agent was helpful").
Define failure severities. A wrong answer, a wrong action, and a harmful action are different problems. Decide now which failures are annoying and which are launch-blocking, so results can be triaged instead of debated.

If you can't write a success criterion for a task, you've found a product ambiguity — resolve it before testing, because the agent certainly hasn't resolved it either.

Step 2: Assemble a user population that looks like production

This is where most agent testing quietly fails. The typical pre-launch test population is the team, a few colleagues, and a handful of hand-written personas — all of whom share the builders' vocabulary, patience, and mental model of the product. Production users share none of it. They're terse, non-linear, multilingual, distracted, and arrive with account states and life situations the team never role-plays.

So build the test population deliberately:

Start from who your users actually are — ages, languages, income and household situations, tech comfort, typical contexts — not from who's available to test.
Cover behavior, not just demographics. Vague first messages, mid-conversation topic changes, typos, frustration, users who answer questions with questions.
Get to realistic scale. Ten personas can't represent thousands of users. Coverage failures live in the long tail, and the tail only appears at population scale.

Doing this by hand is the bottleneck, which is why synthetic users exist as a category — simulated people with realistic demographics and behavior who can each hold a conversation with your agent. Our complete guide to synthetic users covers the approach in depth. Synthetic Signals's version of this is a Census-grounded population: thousands of distinct synthetic residents of a real city, each with real demographics, a personality, and a daily context — so the test distribution is anchored to actual population statistics rather than to whatever a team imagined. However you build it, the test is the same: if your test users would all get hired at your company, they are not your users.

Step 3: Run real multi-turn conversations

Single-prompt tests measure a product you're not shipping. Real usage is conversational: users under-specify, correct themselves, ask follow-ups, return later expecting to be remembered. Your tests must too.

Full conversations to task completion or abandonment — not one exchange. Many failures only appear at turn 4+: lost context, forgotten constraints, contradictions with earlier turns.
Follow-ups and second sessions. If the product implies continuity ("check my order again"), test continuity. This requires test users with memory across sessions — a capability worth checking your tooling for explicitly.
Failure-path conversations. Tools erroring, users refusing to give information, requests outside scope. Graceful degradation is a feature; test it like one.

See multi-turn evaluation for the mechanics of constructing and scoring these conversations.

Step 4: Score with an explicit lens

Every conversation now needs a verdict, and "I read a few and they looked good" doesn't survive contact with hundreds of transcripts. Make the scoring method explicit:

Assertions for anything code can check: the right tool called with the right arguments, the end state correct, format contracts honored.
An LLM judge for open-ended quality — with a concrete rubric built from your Step 1 success criteria, and calibrated against a human-labeled sample. The full craft (judge designs, bias mitigations, calibration) is in our LLM-as-a-judge guide.
Human review for a sample of transcripts, especially failures — both to calibrate the judge and because reading real failure transcripts is the fastest way to understand your agent.

Score at two levels: trajectory-level (did the task get done?) as the headline metric, turn-level as the diagnostic. And keep the lens yours — the criteria that define "works" for your product came from Step 1, not from a tool's default metric. This is why Synthetic Signals treats scoring as bring-your-own: the platform generates the conversations; the lens (judge, rubric, or custom metric) is defined by you.

Step 5: Break results down by cohort

An aggregate pass rate is where failures hide. An agent can post a strong overall score while failing specific groups — non-native speakers, older users, low-income account states, one language, one task — because those groups are minorities of the test set and the average absorbs them.

So never stop at the average:

Slice results by every axis you built into the population: age, language, income, tech comfort, task, conversation length.
Look for the worst cell, not the mean. The question isn't "what's the score?" but "who does it fail?" — your launch risk is concentrated in the worst-served cohort, who are also the users least likely to file a polite bug report.
Check per-cohort sample sizes. A cohort with six conversations has no measurable pass rate; go back to Step 2 and generate more.

This is the entire argument of cohort coverage: coverage, not a single score, is the deliverable of pre-production testing.

Step 6: Freeze failures into a regression suite and gate releases

Every failure you find is an asset — if you can replay it.

Reproduce it. For non-deterministic systems this means pinning whatever can be pinned: model versions, prompts, tool versions, and the test user and conversation that triggered it. (In Synthetic Signals, the same seed regenerates the same city and the same people, so a failing conversation replays exactly.)
Add it to the suite. The failing conversation, its cohort, and its success criterion become a permanent test case. The suite becomes a ratchet of everything that has ever gone wrong.
Gate releases on it. Run the suite on every prompt change, model swap, and tool update. Because runs are stochastic, gate on rates with a noise margin, not single-run pass/fail — a case that passes 9 of 10 runs isn't fixed. The details are in regression testing non-deterministic agents.

Teams that skip this step re-discover the same failures release after release. It's the difference between testing as an event and testing as infrastructure.

Step 7: Keep testing after launch

Pre-production testing narrows what launch discovers; it doesn't eliminate it. After launch:

Sample and score production conversations with the same lens from Step 4, so offline and online numbers are comparable.
Feed every production failure back into the population (a user type you didn't model) and the regression suite (a case you'll never re-ship).
Re-run the full suite on every upstream change — a model provider update is a release, whether you asked for it or not.

The loop is circular by design: production is where you learn what your population was missing, and the suite is how that learning compounds.

The pre-launch checklist

Before your agent meets real users, you should be able to check every box:

Every major task has a written success criterion and failure severity
Test population reflects real user demographics and behavior, not the team
Population is large enough for per-cohort sample sizes, not just a total
Tests are full multi-turn conversations, including follow-ups and second sessions
Failure paths (tool errors, refusals, out-of-scope asks) are tested deliberately
Scoring method is explicit; the LLM judge (if used) is calibrated against human labels
A human has read a sample of failing transcripts
Results are broken down by cohort, and the worst cohort is known and acceptable
Every found failure is reproducible and frozen in a regression suite
Releases (including model updates) are gated on the suite
Post-launch sampling and feedback into the suite is set up before launch, not after

None of this requires any particular product — a disciplined team can do it with scripts. What a platform changes is the cost of Steps 2, 3, and 5: a realistic population, thousands of multi-turn conversations, and per-cohort breakdowns are exactly the parts that don't scale by hand. That's the part Synthetic Signals automates; the definitions, the lens, and the launch bar stay yours.

FAQ

How do you test an AI agent before production?

Define what success means per task, assemble a test user population that resembles production users, run full multi-turn conversations rather than single prompts, score with an explicit method, break results down by user cohort, freeze every failure into a regression suite, and keep evaluating after launch.

Why is testing AI agents harder than testing normal software?

Agents are non-deterministic (the same input can produce different outputs), their behavior depends on multi-step trajectories of tool calls and conversation state, and their quality varies by who is talking to them — so single-input, exact-output tests do not work.

How many test conversations do you need before launching an agent?

Enough to cover your main tasks across your main user types with a meaningful sample of each — which is usually hundreds to thousands of conversations, not dozens. The number that matters is per-cohort sample size, not the overall total.

Can you test an AI agent without real users?

Yes, up to a point. Synthetic users can hold realistic multi-turn conversations at a scale no test team can, which surfaces most task, tool, and cohort failures before launch. Real-user validation afterward is still necessary — synthetic testing narrows what launch has to discover, it does not replace it.