June 1, 20266 min read

User Simulation for AI Agents: How It Works

A technical guide to user simulation for AI agents: simulator anatomy, the conversation simulation loop, what to log and score, and the classic pitfalls.

Alex Gvozden

Synthetic users Agent testing

User simulation puts a model-driven simulated user — with a persona, a goal, context, and stop conditions — on the other side of a conversation with your AI agent. The simulator generates realistic multi-turn dialogue at scale; you log every exchange and score the transcripts. It's how agent testing escapes the demo script.

Why simulate users at all

An AI agent's input space isn't a set of prompts; it's people — each arriving with their own phrasing, patience, background knowledge, and willingness to cooperate. Scripted test cases cover the conversations you thought of. Real users immediately have the ones you didn't: the vague opener, the mid-task topic change, the answer to a question you didn't ask.

User simulation (sometimes called conversation simulation) is the middle path between scripts and production traffic: conversations varied enough to surprise you, cheap enough to run thousands of times, and controlled enough to rerun when something breaks.

Anatomy of a user simulator

A user simulator that produces useful test signal has four load-bearing parts. In practice, most homegrown simulators are missing at least two.

Persona conditioning

The simulator plays a specific person, and everything about that person should shape the dialogue: age, language and fluency, income, household, occupation, personality, tech comfort. Persona conditioning isn't flavor text — it's what determines whether a user writes "I would like to dispute a charge on my March statement" or "hey why did u take $40". The richer and more internally consistent the persona, the more the conversation diverges from LLM-default behavior. This is why serious simulation efforts treat the persona as a whole modeled person rather than a two-line character prompt.

A goal

The simulated user must want something concrete: cancel a subscription, find out if they qualify for a program, get a refund for a specific order. Goals give conversations direction and give you a ground-truth success criterion ("did the user get what they came for?"). Goals should also vary in difficulty and well-formedness — real users often start with a symptom ("my bill looks wrong") rather than a request ("apply the promo credit from March 12").

Context

What does this user know, and what's true in their world? Context includes their situation (a schedule, a location, a household), their history with your product, and the facts of their case. Context is what lets a simulated user answer the agent's clarifying questions consistently instead of hallucinating new details every turn — and it's what makes follow-up sessions possible. A user who remembers previous conversations can come back tomorrow and say "it still doesn't work," which is a scenario one-shot simulators literally cannot produce.

Stop conditions

When does the conversation end? Real users succeed, give up, get angry, or leave to try again later. A simulator needs explicit termination logic: goal achieved, patience exhausted (turn or frustration budget), agent loop detected, or user escalation. Without stop conditions, simulated conversations either run forever or end wherever the LLM feels narrative closure — neither of which resembles a real session boundary. Abandonment, in particular, is a first-class outcome you want to measure, not a failure of the harness.

The simulation loop

Against an agent under test, the loop per conversation looks like this:

sample user (persona, goal, context)  →  user sends opening message
  ↺  agent responds  →  simulator updates its state
     (goal progress? new info? patience spent?)
     → simulator replies in persona … until a stop condition fires
→ transcript + outcome + metadata written to the log

Run that across a population of users rather than a handful, and across multiple sessions per user where memory matters. Two properties of the harness matter as much as the simulator itself:

Determinism where you can get it. Same seed → same population, same people, same scenarios. Reproducibility is what turns "we saw a weird failure once" into a permanent regression test you rerun on every agent change — the subject of regression testing non-deterministic agents.
Separation of concerns. The simulator produces behavior; a separate scoring pass judges it. Keep them apart (more on why below).

What to log and score

Log more than the transcript. The minimum useful record per conversation:

What	Why it matters
Full transcript, per-turn timestamps	The raw evidence for every downstream judgment
User profile (demographics, persona, goal)	Enables cohort breakdowns later
Outcome (goal met / abandoned / escalated) + which stop condition fired	Your top-line signal, and abandonment analysis
Turn count and user effort	"Succeeded in 3 turns" vs "succeeded in 19" are different products
Agent-side traces (tool calls, retrievals, errors)	Ties conversational failures to their mechanical causes — OpenTelemetry-style tracing makes this cheap
Seed / run metadata	Reproducing any conversation exactly

Scoring is a deliberately separate stage, and there is no single right lens: task completion, instruction adherence, tone, factual accuracy, and efficiency all matter differently per product. Define your own — an LLM judge, a rubric, a programmatic check — and apply it to logged transcripts after the fact. Then read results by cohort, not just in aggregate: an average hides the fact that your agent fails limited-English users or first-time customers specifically. That breakdown — coverage across age, language, income, situation — is usually where simulation pays for itself.

Common pitfalls

User simulation has well-worn failure modes. Every team building a simulator hits at least the first two.

The overly agreeable simulator

LLMs are trained to be helpful, and it leaks into their roleplay. Left alone, simulated users accept mediocre answers, thank the agent for failing, and never rage-quit. Your pass rates inflate accordingly. Mitigations: condition patience and disposition explicitly in the persona, require goal verification ("the user does not consider this resolved until X is actually true"), and audit transcripts for the tell — conversations that end politely with the goal unmet but the outcome marked successful.

Mode collapse into similar users

Ask one model to improvise a thousand users and you get a thousand paraphrases of the same person: articulate, cooperative, demographically default. Prompting for "diversity" produces surface variety, not distributional realism. The structural fix is grounding: sample the population from real statistics so the simulator is told who to be rather than asked to invent it. This is the core argument of the synthetic users guide, and it's why Synthetic Signals builds its simulator population as a Census-grounded city of thousands of distinct residents instead of a prompt that says "be diverse."

Leaking the rubric into the simulator

If the simulated user knows what the judge rewards, conversations bend toward scoreable moments — users conveniently ask exactly the questions the rubric checks, and metrics detach from reality. It's Goodhart's law inside your own harness. Keep the simulator blind to evaluation criteria; drive it only by persona, goal, and context.

Simulator–agent co-drift

Subtler and nastier: you tune the agent against the simulator, then tune the simulator (to be "more realistic") in ways informed by the agent, and the pair co-evolves into a private dialect. Metrics keep improving; real-user performance doesn't. Defenses: version the simulator population and change it deliberately (not reactively), keep a frozen holdout population you never tune against, and periodically compare simulated transcripts to real production conversations to check the simulator still resembles your actual users.

Honest limits

Even a well-built simulator is an approximation. LLM-played users under-produce some real behaviors — typos, half-read replies, multitasking silence, genuine emotional volatility — and grounding data covers demographics far better than attitudes. Treat simulation as the widest early net, not a certification: findings are evidence to confirm, and real traffic remains the ground truth that calibrates the whole setup.

Getting the diversity problem right

It's worth ending where most simulators quietly fail. The value of user simulation scales with how different the simulated users genuinely are — different goals, phrasings, fluency, patience, context. That difference can't be prompted into existence; it has to be inherited from data. Ground the population, condition the behavior on each grounded person, keep scoring separate, and rerun on every change. That's the difference between a simulator that flatters your agent and one that finds the users it fails before they find you.

FAQ

What is user simulation for AI agents?

User simulation uses a model-driven simulated user — with a persona, a goal, and context — to hold realistic multi-turn conversations with an AI agent under test, so teams can evaluate the agent's behavior at scale before real users interact with it.

What are the main components of a user simulator?

Four things: persona conditioning (who the user is), a goal (what they want), context (what they know and their situation), and stop conditions (when they give up, succeed, or escalate). Miss any one and the conversations stop resembling real usage.

Why do simulated users all end up sounding the same?

Because they're improvised from the same LLM with the same priors, generated personas collapse toward a default: polite, articulate, cooperative. Grounding the simulated population in real statistics, such as Census data, is the most effective fix.

Should the user simulator see the evaluation rubric?

No. If the simulator knows what the judge rewards, it steers conversations toward scoreable moments and your metrics inflate. Keep simulation and scoring strictly separated, with the judge applied to transcripts after the fact.