July 2, 20267 min read

Synthetic Users: The Complete Guide (2026)

What synthetic users are, how they're generated, and how teams use them to test AI agents at population scale — beyond hand-written personas.

Alex Gvozden

Synthetic users Agent testing

Synthetic users are simulated people — generated with realistic demographics, context, and goals — that stand in for real participants when you test software. Unlike static personas, they behave: they ask questions, follow up, get confused, change their minds. For AI agents, they make it possible to test against thousands of distinct users before a single real one shows up.

What synthetic users are (and aren't)

The term gets used loosely, so it's worth pinning down. A synthetic user has three properties that together separate it from its neighbors:

It's a person model, not a description. It has attributes — age, language, income, household, occupation — plus a personality and a situation it's currently in.
It generates behavior. It can hold a conversation, pursue a goal, abandon a task, come back later. It produces transcripts, not bullet points.
It's used as a test participant. Its purpose is to interact with your product — usually an AI agent, chatbot, or interface — so you can observe what happens.

Three adjacent concepts get conflated with synthetic users constantly. The differences matter in practice:

Concept	What it is	What it produces	Where it's useful
Persona	A human-written description of an archetypal user	A document	Design alignment, empathy, stakeholder communication
Synthetic data	Generated records that mimic a dataset's statistics	Rows, text, images	Training models, augmenting scarce data, privacy-safe sharing
Synthetic respondents	Simulated survey-takers used in market research	Survey answers, opinions	Concept tests, pricing research, early signal on messaging
Synthetic users	Simulated people that interact with software	Behavior: conversations, sessions, task attempts	Testing AI agents and products before (and alongside) real users

The persona distinction is the one most teams trip on. A persona is static — it can't answer a follow-up question, because there's nobody there. A synthetic user is the persona plus an engine: something that can actually sit on the other side of a conversation. We go deeper on this in Synthetic Personas vs. Hand-Written Personas.

The synthetic-respondents distinction matters too. Market-research tools simulate opinions — what would this segment say about this ad? Synthetic users simulate behavior — what happens when this person tries to use your product? Both are legitimate; they answer different questions. (We cover the research side separately in Synthetic Users for UX Research.)

How synthetic users are generated

There are two broad approaches, and the difference between them is the difference between a useful test population and an expensive mirage.

Approach 1: ask an LLM to invent people

The naive method: prompt a language model with "generate 500 diverse users for a banking app." It works, in the sense that you get 500 profiles. But LLMs invent people the way they complete any text — by sampling from what's most probable. The result clusters hard around defaults:

Ages bunch in the 25–45 band; incomes bunch around round, comfortable numbers.
Nearly everyone is a fluent, articulate English speaker who communicates in complete sentences.
Occupations skew toward the professional class the training data over-represents.
"Diverse" profiles read like a checklist — surface-level variety draped over the same underlying person.

Even explicit instructions to diversify tend to produce token variety rather than distributional realism. You get one 78-year-old, one non-English speaker, one low-income user — a museum of edge cases, not a population. And because every profile came from the same generator with the same priors, the behavior downstream converges too: everyone is patient, everyone phrases requests clearly, everyone cooperates.

Approach 2: ground the population in real statistics

The alternative is to sample people from actual population data and let the statistics — not the model's imagination — decide who exists. This is the approach Synthetic Signals takes: its synthetic population is a whole city of thousands of distinct residents, synthesized from real US Census data and placed on a real San Francisco map. The joint distributions do the work. Age, language, income, and household structure co-occur the way they actually co-occur, because they were sampled from data about real people rather than invented one profile at a time.

Grounding gets you three things prompting can't:

Distributional realism. The share of your test population that speaks limited English, lives alone, or is over 70 matches reality — not the LLM's defaults.
Internal coherence. Each generated person hangs together: the demographics, the household, the job, the daily schedule are mutually consistent, so the behavior they condition is consistent too.
No silent gaps. When people are sampled rather than imagined, the cohorts you'd never think to write down — the ones that break your agent — show up anyway.

The generation itself typically layers three components: a statistical skeleton (demographics sampled from Census or similar data), a personality and context layer (traits, a household, a schedule, current circumstances), and a behavioral engine (an LLM conditioned on all of the above, playing that person in conversation). The skeleton keeps the engine honest.

Using synthetic users to test an AI agent

Synthetic users earn their keep as a testing loop, not a one-off demo. The loop has five stages:

1. Population

Define who your agent will meet. Ideally that's a grounded population you can filter and shape toward your market — not twenty archetypes, but a distribution.

2. Conversations

Run the population against the agent. Each synthetic user brings its own goal, phrasing, patience, and context, and the conversations are multi-turn — because real usage is. Users that remember earlier sessions let you test follow-ups and second contacts, not just cold opens. (The mechanics of this — goals, stop conditions, simulator pitfalls — are the subject of User Simulation for AI Agents.)

3. Scoring lens

Decide what "good" means, explicitly. There is no universal score for agent quality; a support bot, a booking agent, and a benefits-navigation assistant fail differently. Bring your own lens — an LLM judge, a rubric, a task-completion check, a custom metric — and apply it uniformly across every transcript. Tools should let you define this rather than imposing one number; Synthetic Signals deliberately treats scoring as bring-your-own for exactly this reason.

4. Cohort coverage

Break results down by who the user was. An aggregate score is where failures hide: an agent can average 90% while reliably failing non-native speakers or users over 65. Coverage views — pass rates by age, language, income, life situation — turn "the agent is mostly fine" into "the agent fails this specific group, here's the transcript."

5. Regression

When you find a failure, keep it. If your runs are reproducible — same seed, same population, same people — a failing conversation becomes a permanent regression test: fix the agent, re-run the exact scenario, prove the fix holds, and keep re-running it on every change. This is the step that turns testing from an event into a practice; we dig into it in Regression Testing Non-Deterministic Agents.

Then you loop. Every agent change re-runs the population; every new failure joins the suite.

Limits and honest caveats

Synthetic users are a tool with real failure modes. Anyone selling them without caveats is selling too hard.

LLM roleplay artifacts. The behavioral engine is still a language model playing a part. Unmitigated, simulated users tend to be more polite, more coherent, and more forgiving than real people. They rarely rage-quit, half-read a message, or type "ok but that didnt work" and vanish. Good simulation design fights this — grounded context, explicit patience and communication traits, honest stop conditions — but the gravity is always toward agreeableness.

Distribution fidelity has edges. Census-style grounding nails demographics and household structure. It does not directly give you attitudes, product familiarity, or trust in AI — those are modeled, and models can be wrong. Treat cohort coverage claims as strong and cohort psychology claims as approximate.

Simulated failure ≠ real failure (and vice versa). A synthetic user finding a bug is evidence, not proof; confirm serious findings against real transcripts when you have them. And a clean synthetic run doesn't certify the agent — it means the failures you knew how to simulate didn't occur.

You still need real users. Synthetic populations are for before and between: before launch, and between real-user touchpoints, when you need scale and repeatability. Real usage remains the ground truth that calibrates the whole apparatus. The strongest setups feed real-world failures back into the synthetic suite as permanent tests.

Where this is going

The category's credibility jumped when Stanford researchers built generative agents from long interviews with roughly a thousand real people and showed the agents could reproduce those individuals' survey responses about as consistently as the participants reproduced their own answers weeks later. The result — covered in our take on the paper — put an academic floor under the idea that grounded simulations of people can be measurably faithful, not just plausible.

Expect three shifts from here. First, grounding becomes table stakes: "we prompted GPT to act like users" will stop passing review, and populations will be judged on their distributions. Second, synthetic users move into CI: reproducible populations make agent testing look like software testing, with regression suites and gates rather than vibe checks. Third, validation matures: the open research problem is measuring how closely synthetic behavior tracks real behavior per domain — and teams will increasingly demand that evidence before trusting a simulated crowd.

Synthetic users won't replace real ones. But for AI agents — products whose entire surface area is conversation with unpredictable people — they're becoming the difference between shipping on hope and shipping on evidence.

FAQ

What is a synthetic user?

A synthetic user is a simulated person — with demographics, context, and behavior — used in place of a real participant to test software. For AI agents, synthetic users hold realistic conversations with the agent so teams can evaluate it before real users ever touch it.

Are synthetic users the same as personas?

No. A persona is a one-page description a human writes; a synthetic user is a generative model of a person that can actually behave — ask questions, follow up, get confused, and change its mind — across thousands of test conversations.

How are synthetic users different from synthetic data?

Synthetic data is generated records — rows, images, text — used to train or augment models. A synthetic user is an active simulation of a person that produces behavior over time. One is a dataset; the other is a participant.

Can synthetic users replace real user testing?

No — they complement it. Synthetic users give you scale and coverage before launch; real users give you ground truth after. The teams that do this well use synthetic populations to find failures early, then confirm the fixes with real traffic.

Why does statistical grounding matter for synthetic users?

Left to improvise, language models invent similar people — articulate, cooperative, demographically clustered around defaults. Sampling users from real population statistics, like Census data, forces the simulated crowd to match how varied real users actually are.