May 25, 20266 min read

Regression Testing Non-Deterministic AI Agents

LLM regression testing when the same input no longer gives the same output: pin seeds and populations, sample runs, score semantically, gate releases.

Bob Miagi

Agent testing Reliability

Regression testing non-deterministic agents requires abandoning the core assumption of classic regression testing — same input, same output. The workable replacement: pin every source of variation you can (seeds, model versions, test populations), sample the ones you can't (repeated runs with statistical pass criteria), and score meaning instead of exact strings.

Why classic regression testing breaks

A regression test is a contract: this input produced this output when the feature worked, so if the output ever changes, something broke. The entire method leans on determinism — one run per test, byte-exact comparison, binary result.

LLM-based agents void the contract. Run the same conversation twice and the agent phrases things differently, orders tool calls differently, occasionally takes a different path entirely — and sometimes only occasionally fails, which is the worst case for a one-shot test. A suite built on exact matching either fails constantly on harmless variation (so people stop trusting it) or gets loosened until it catches nothing. Both endings are the same: no regression protection for the system that needs it most.

The fix isn't to force determinism everywhere — you can't. It's to be surgical about where variation comes from, remove what's removable, and treat the rest statistically.

Know your sources of non-determinism

Different sources need different countermeasures, so name them first:

Sampling. Temperature-based decoding is intentionally random. Even temperature 0 isn't a guarantee of bit-identical output on most hosted APIs (batching and hardware effects leak in), but it removes most variance.
Model updates. Hosted models change under stable-sounding aliases. The regression that hits you in March may be the provider's, not yours — invisible unless you pin versions and re-baseline deliberately.
Prompt and context drift. Retrieved documents change, timestamps advance, conversation history accumulates differently. The "same input" quietly wasn't.
Tool responses. Live APIs return different inventory, prices, weather. Your test failed because a flight sold out, not because the agent broke.
The test user. In simulated-user testing, the user is a model too. If your test population is regenerated differently each run, every conversation is unrepeatable — you've added a second stochastic system on top of the first.

Strategy 1: Pin what you can

Every source of variation you eliminate makes the remaining statistics cheaper. In rough order of leverage:

Pin model versions. Use dated/versioned model identifiers, never floating aliases. Upgrade deliberately, as its own tested change with its own baseline run — never as a side effect.
Pin prompts and configuration. Version-control every prompt, temperature, and tool schema. A regression suite can't attribute a change it can't diff.
Pin tool behavior. Record real tool responses and replay them in tests, or mock tools with fixed fixtures. Live dependencies belong in a separate smoke tier, not in the regression suite.
Pin seeds where the stack supports them. Seeded sampling, seeded data generation, seeded anything.
Pin the test population. If you test with simulated users, they must be exactly reconstructible. This is a design principle Synthetic Signals commits to end-to-end: the same seed regenerates the same city and the same people — same demographics, personalities, and schedules — so the conversation that failed on Tuesday can be replayed, exactly, until the fix holds.

Pinning has a cost worth stating: a fully pinned world can drift away from production reality (frozen tool fixtures go stale, one frozen population becomes something you overfit to — see eval-driven development on held-out sets). Pin for comparability; refresh deliberately, on a schedule, as a versioned change.

Strategy 2: Sample what you can't pin

Whatever variance remains after pinning, measure it instead of pretending it away.

Run each case N times. A case passes if it clears a threshold — "at least 18 of 20 runs succeed" — not if a single run does. Choose N per case by criticality: 5 runs catches gross breakage; 20+ distinguishes 95% from 85% reliability.

Use pass@k and pass^k deliberately — they answer opposite questions. Pass@k (succeeds in at least one of k tries) measures capability, and is appropriate when a user can retry cheaply. Pass^k (succeeds in all k tries) measures consistency, and is the honest metric for anything autonomous or high-stakes — τ-bench introduced it precisely because agents that looked fine on single runs collapsed when asked to succeed repeatedly (our benchmarks explainer covers this). Gate critical flows on pass^k; you'll be startled by what fails.

Set statistical pass criteria, not vibes. With N runs per case you can define a regression as a statistically meaningful drop against baseline rather than any drop. A case that went 19/20 → 18/20 is probably noise; 19/20 → 12/20 is a finding. Even a crude rule ("flag if the pass count drops by more than expected binomial noise") beats eyeballing dashboards.

Budget accordingly. N-times execution multiplies cost, which is why the tiering matters: small-N on everything per merge, large-N on critical flows pre-release, and cheap deterministic assertions (which need only one run) doing as much of the work as possible.

Strategy 3: Score semantically, not literally

Exact-string comparison is the wrong oracle even for a pinned system. Compare meaning and effect instead, in order of preference:

End-state assertions. The refund exists, sized correctly; the calendar event moved. Immune to phrasing entirely — the strongest oracle an agent test can have.
Structural/behavioral checks. The right tools were called with the right arguments; forbidden actions weren't taken; required disclosures appeared.
Semantic scoring. For qualities only language carries — was the explanation accurate, did the agent handle the correction gracefully — use a rubric-driven LLM judge. One regression-specific caveat: the judge is a model too, so pin its version and prompt as strictly as the agent's, or judge drift will masquerade as agent regressions. Re-calibrate the judge against human labels whenever you do change it.

For conversational agents, remember that the unit under regression is often the whole conversation, not a response — a repair-after-misunderstanding flow that used to work is a classic silent regression. Multi-turn evaluation covers scoring at that granularity.

Every production failure becomes a permanent test

This is the habit that makes the whole apparatus compound. When production surfaces a failure:

Capture the scenario — the user's goal, context, and conversation shape, not just the final prompt.
Reproduce it in the test environment. With a seeded population this can be exact: recreate the same synthetic user and replay the interaction. Synthetic Signals's version of this is turning any failed conversation into a rerunnable case, same person included.
Fix, then verify statistically — the case passes at your threshold over N runs, not once.
Keep it forever (until the feature it tests is removed). The suite becomes an accumulating immune system: everything that ever bit you, still watching.

Do this for six months and your regression suite is shaped like your actual risk, not like your team's guesses.

Release-gating in CI

Wire it into delivery or it decays into a dashboard nobody reads:

Per-merge: deterministic assertions plus small-N (3–5) runs on a smoke set. Minutes, not hours.
Pre-release: full suite at real N, pass^k on critical flows, compared against the last release's baseline with statistical thresholds. Block on hard-gate failures; report soft-gate drift for a human call.
On any pinned-dependency change (model upgrade, judge upgrade, population refresh): a dedicated re-baselining run whose diff — including the per-cohort breakdown, so you catch a regression that only hits one user group (coverage, not averages) — is reviewed like a code change.

None of this restores the old one-run certainty; nothing will. What it restores is the thing regression testing was always for: the ability to change your system on Friday and know, with stated confidence, that you didn't break what worked on Thursday.

FAQ

What is LLM regression testing?

LLM regression testing verifies that changes to an AI system — new prompts, new models, new tools — did not break behavior that previously worked. Because LLM outputs vary between runs, it replaces exact-match assertions with pinned inputs, repeated sampling, statistical pass criteria, and semantic scoring.

Why does normal regression testing fail for AI agents?

Classic regression testing assumes the same input produces the same output, so one run per test is enough. Agents violate that assumption: sampling, model updates, shifting context, and live tool responses all change outputs between runs, so a single pass or fail is noise rather than signal.

How many times should I run each regression test?

Enough to distinguish a regression from run-to-run noise for the reliability you need — often 5 to 20 runs per case, with a threshold like at least 18 of 20 passing. Critical flows deserve stricter, pass-every-time criteria in the spirit of tau-bench's pass^k metric.

How do you reproduce a one-off agent failure?

Pin everything the failure depended on: model version, prompts, tool mocks, and — if the failure involved a specific user or conversation — the exact test user. Seeded synthetic populations make this practical: the same seed regenerates the same person, so the failing scenario can be replayed until it is fixed, then kept as a permanent case.