Stanford's Generative Agent Simulations of 1,000 People
What Stanford's generative agent simulations of 1,000 people actually showed — and what it means for testing AI agents against synthetic populations.
In late 2024, researchers at Stanford and Google DeepMind published generative agent simulations of 1,000 people: LLM agents built from two-hour interviews with 1,052 real individuals. The agents reproduced their humans' survey answers 85% as accurately as the humans reproduced their own answers two weeks later. That result quietly changed what "synthetic users" can credibly claim.
What the paper actually did
The paper — Generative Agent Simulations of 1,000 People (Park et al., 2024) — came from Joon Sung Park's group at Stanford, the same lab behind the 2023 "Smallville" generative-agents paper, with collaborators at Google DeepMind and several universities.
The method, at a high level:
- Recruit a representative sample. 1,052 U.S. participants, recruited to reflect the population across demographics rather than a convenience panel.
- Interview each person for about two hours. An AI interviewer conducted a long qualitative interview covering each participant's life story, values, work, relationships, and views — thousands of words of first-person material per person.
- Build one agent per person. Each agent is a large language model conditioned on that individual's full interview transcript. When queried, the model answers as that person, grounded in what the person actually said.
- Test the agents against ground truth. Both the humans and their agents completed the General Social Survey (a canonical U.S. attitudes survey), Big Five personality inventories, and a set of behavioral economic games. Two weeks later, the humans took the surveys again.
That two-week retest is the clever part. People are not perfectly consistent — ask someone the same attitude questions twice and the answers drift. So instead of grading agents against an impossible standard of frozen human opinions, the paper normalizes agent accuracy against each participant's own self-consistency.
What the results showed
Three findings matter most:
- On the General Social Survey, agents scored 85% of the human self-consistency ceiling. The agents predicted participants' answers 85% as accurately as participants predicted their own answers two weeks later. Personality-trait predictions were comparably strong; predictions of behavior in economic games were positive but noticeably weaker.
- Interviews beat demographics. Agents built from the full qualitative interview outperformed agents built only from a demographic profile or a short persona paragraph. Two hours of a person's own words carries information that "34, female, suburban, college-educated" does not.
- Interview grounding reduced group bias. Demographic-only agents were less accurate for some racial and ideological groups than others. Interview-based agents shrank those accuracy gaps — grounding in individual data made the simulation fairer, not just better.
If you are checking specifics, go to the paper itself; the authors also released a restricted-access agent bank for researchers. But the headline is fair to state plainly: LLM agents grounded in rich individual data reproduced real people's measured attitudes at close to the reliability of the people themselves.
What this validates
It's worth being precise, because this paper gets both overclaimed and underclaimed.
It validates that simulated people are not inherently caricatures. The standard objection to synthetic users — "an LLM just plays a stereotype" — turns out to be an objection to ungrounded synthetic users. The paper shows the failure mode is in the grounding, not the concept: demographic-prompt agents were more biased and less accurate; interview-grounded agents were better on both counts.
It validates data-grounded construction over invention. The result is an existence proof that fidelity scales with the quality of real-world data behind each agent. That's the difference between asking a model to "imagine a retired nurse in Ohio" and conditioning it on evidence about actual people.
It validates measured-attitude replication specifically. Survey answers, personality inventories, structured choices — the things the study tested — are exactly the kind of bounded, expressible behavior LLMs are good at reproducing.
What it does not validate
Honest limits, because they matter:
- Behavioral prediction was the weak spot. The agents were noticeably better at reproducing what people say (attitudes, traits) than what they do under incentives (economic games). Anything downstream of "will this person actually take the action" inherits that gap.
- It says nothing about novel situations. The study tested instruments adjacent to what the interview covered. Whether an agent predicts its person's reaction to a genuinely new product, crisis, or interface is untested.
- It doesn't license skipping real people. The whole method starts with two hours of a real person's time. It's human data, compressed and made queryable — not a substitute for collecting human data. We go deeper on this boundary in Synthetic Users for UX Research: Promise and Limits.
- One country, one moment. A U.S. sample, interviewed once. Attitudes drift; populations differ.
What it means for testing AI agents
Here's the part most coverage missed: the paper's hardest problem — replicating a specific, named individual — is not the problem you face when testing an AI agent before launch.
If you're shipping a support agent, an onboarding assistant, or a scheduling bot, you don't need a simulation of Maria Delgado of Oakland, accurate to her GSS answers. You need a population whose distribution matches your users: the right mix of ages, languages, incomes, household situations, patience levels, and phrasings — so that when your agent fails for a kind of person, someone of that kind exists in your test set to fail against.
That's a strictly easier bar, and the Stanford result is strong evidence it's already crossed:
| Requirement | Stanford paper | Agent testing |
|---|---|---|
| Fidelity target | One specific individual | A realistic distribution of people |
| Grounding data | 2-hour interview per person | Population statistics (e.g., Census) |
| Success criterion | Match that person's answers | Surface failures a real cohort would hit |
| Cost of an error | Misrepresents a real person | One test conversation is slightly off |
Individual replication demands per-person interviews. Distributional realism can be built from statistical ground truth. This is the thesis behind Synthetic Signals: synthesize a whole city — thousands of distinct residents grounded in U.S. Census data, each with real demographics, a personality, a household, and a daily schedule — and run your agent against all of them. You're not betting that citizen #4,517 perfectly mirrors one real San Franciscan; you're betting the population, in aggregate, is shaped like your users. The paper says the underlying machinery — LLMs conditioned on real individual-level structure — produces believable, non-stereotyped people. The bias finding matters here too: grounded agents were more even across groups, which is precisely what you want when the point of testing is finding the cohorts your agent fails.
Where this is heading
The paper's authors frame their agent bank as infrastructure for social science — piloting surveys, testing interventions, studying group dynamics without burning participant goodwill. The engineering translation is already underway: simulated populations as a standard pre-production stage for anything conversational, the way load testing became standard for anything networked.
Two things to watch. First, replications and extensions — one strong paper is one strong paper, and follow-up work on non-U.S. samples and longitudinal drift will tell us how far the result generalizes. Second, the attitude-vs-behavior gap: as models improve at predicting actions rather than statements, simulation moves from "credible test user" toward "credible pilot study." It isn't there yet, and anyone claiming otherwise is ahead of the evidence.
For a broader map of the category the paper legitimized, see the complete guide to synthetic users.
The bottom line
The Stanford paper is the best evidence to date that simulated people built on real data are faithful enough to be useful — 85% of human self-consistency on measured attitudes, with less group bias than demographic prompting. It does not make synthetic people oracles of behavior. It makes them credible test users. For teams shipping AI agents, that's the claim that matters, and it now has data behind it.
FAQ
What is the Stanford 1,000 people simulation paper?
It's a late-2024 study, Generative Agent Simulations of 1,000 People, by researchers at Stanford and Google DeepMind. They built LLM-based agents from two-hour qualitative interviews with 1,052 real people, then tested whether each agent could reproduce its person's survey answers, personality measures, and behavior in economic games.
How accurate were the generative agents?
On the General Social Survey, the agents matched participants' answers 85% as accurately as the participants matched their own answers when re-asked two weeks later. In other words, the agents approached the ceiling set by human self-consistency on attitude surveys, though behavioral predictions were weaker.
Does this mean synthetic users can replace real users?
No. The paper validates that grounded simulations can reproduce measured attitudes of specific people, not that they can predict novel behavior, real-world purchasing, or emotional responses. It makes simulated people credible as test users for software, not as replacements for human research participants.
What does the paper mean for AI agent testing?
It's evidence that populations of simulated people grounded in real data behave with enough fidelity to be useful test users. Testing an agent against a statistically grounded synthetic population is a weaker requirement than replicating a specific individual, which is what the paper achieved.