June 8, 20266 min read

Multi-Turn Evaluation: Testing the Whole Conversation

Why single-turn evals miss real failures, and how multi-turn evaluation works: scripted flows, simulated users, and conversation-level scoring.

Nik Kowalsi

AI evals Agent testing

Multi-turn evaluation tests an AI agent across whole conversations rather than isolated prompt-response pairs. It matters because most conversational failures — lost context, misresolved references, mishandled goal changes, contradictions over time — are invisible at the single-turn level: every turn can look fine while the conversation as a whole fails the user.

The failure modes single-turn evals can't see

If you evaluate your agent on a set of (prompt, expected response) pairs, you are testing a different product than the one you ship. Conversation adds failure modes that only exist between turns.

Context carryover. The user states their account number, their dietary restriction, their deadline — and four turns later the agent asks for it again, or worse, silently proceeds without it. Each individual response is polite and well-formed. The conversation is broken.

Reference resolution. "Do the second one instead." "Same as last time." "Not that one, the other one." Humans compress relentlessly, and the meaning of almost every real utterance leans on what came before. A single-turn eval never contains "the second one," so it never tests whether your agent can resolve it.

Goal shifts. Real users change their minds mid-conversation: the flight change becomes a cancellation; the refund request becomes a product question. Agents anchored on the original goal will keep helpfully solving a problem the user no longer has.

Repair after misunderstanding. The agent gets something wrong; the user corrects it. What happens next is one of the strongest quality signals a conversation produces — graceful repair versus doubling down — and it's structurally impossible to test without at least three turns.

Long-horizon consistency. Turn 2 says the fee is waived; turn 14 quotes the fee. Neither turn is wrong in isolation. Contradiction is a property of the pair, so it's invisible to any evaluator that only ever sees one turn.

The follow-up a week later. The hardest case isn't even in the same session. A user returns days later: "Hey, did that thing ship?" Whether your agent (and its memory architecture) handles that determines whether it feels like a service or a goldfish. Single-turn suites don't just miss this — most multi-turn suites do too, because the test harness resets between runs.

The pattern in all six: correctness lives in the relationships between turns. Score turns independently and you literally cannot represent the failure.

How to evaluate multi-turn: the two input strategies

To evaluate a conversation you first need to generate one. There are two ways, and mature teams use both.

Scripted flows

You write the user side of the conversation in advance: turn 1 says this, turn 3 corrects the agent, turn 5 changes the goal. Deterministic, cheap, and precise — perfect for regression-testing a specific known failure ("agent loses the order ID after a topic change").

The limit is adaptivity. A script can't react to what the agent actually says. If the agent asks an unexpected clarifying question, the script barrels ahead with its pre-written next line, and the conversation goes off the rails in a way no real user would. Scripts test your imagined conversation, one branch of it, exactly.

Simulated users

A model plays the user: it has a goal, a personality, and context, and it responds to whatever the agent actually says — pushing back, getting confused, changing its mind. This is how τ-bench generates its conversations, and it's the only way to explore the branches you didn't script. We cover the mechanics in user simulation for AI agents.

The honest caveats: simulated users are models too. They can be unnaturally cooperative, oddly verbose, or too quick to reveal information a real user would withhold. Grounding them in realistic demographics and context reduces the artifacts; validating a sample of simulated conversations against real transcripts keeps you honest.

A useful division of labor: scripts for regressions you must never re-ship; simulated users for discovering the failures you haven't met yet.

How to score it: turn-level vs. conversation-level

Once you have conversations, you have a granularity decision.

Granularity	What it answers	Strengths	Blind spots
Turn-level	Was each response good given the context so far?	Localizes failure to a turn; fine-grained debugging	Can score a failed conversation as all-fine turns
Conversation-level	Did the user achieve their goal? Was the experience coherent?	Matches what users experience; catches relational failures	Tells you that it failed, not where
Trajectory-level	Were the agent's actions (tool calls, state changes) right across the whole run?	Catches destructive or wasteful paths behind a good transcript	Needs access to more than the transcript

In practice you want conversation-level as the headline metric, turn-level as the diagnostic drill-down, and trajectory checks wherever the agent touches real systems.

For the scoring mechanism itself: programmatic assertions handle end-state checks ("the booking exists, dated correctly"); an LLM judge handles the qualitative conversation-level questions ("did the agent notice the goal change?") — with the usual calibration duties covered in the LLM-as-a-judge guide. Two judge-specific warnings for multi-turn work: long transcripts push judges toward recency bias (over-weighting the final turns), and a rubric written for single responses will silently grade only the last turn. Write conversation-specific rubric items — context retention, repair quality, consistency — and spot-check judge verdicts against human reads of full transcripts.

Also resist averaging turn scores into a conversation score. A conversation with nine good turns and one catastrophic one (the agent confirms the wrong account) is not 90% good.

Testing across sessions, not just turns

Everything above still assumes one continuous conversation. The next tier is multi-session: the same user returns tomorrow, references last week, expects to be remembered. If your agent has memory, this is where it earns its complexity — and where it breaks (see agent memory architectures and how to test them).

Testing it requires test users that persist. This is a place where Synthetic Signals's design is directly relevant: its synthetic citizens remember across sessions — a conversation folds into the citizen's memory, so you can test the follow-up call, the "did that ship?" message, and the long game with the same person, not a fresh instantiation who has an injected summary pasted into their prompt. Because each citizen is a whole person with stable context — demographics, household, schedule — their week-later behavior stays coherent with their first visit. And because runs are seed-reproducible, a multi-session failure can be replayed exactly while you fix it.

A practical starting setup

Write 10–20 scripted flows covering your known conversational risks: a correction, a goal shift, a re-ask for provided info, a contradiction check spanning 10+ turns.
Add simulated-user runs with varied goals and personalities to explore unscripted branches; skim transcripts weekly for new failure categories.
Score at two levels: conversation-level goal success (judge + end-state assertions) as the gate; turn-level judgments as diagnostics.
Add one multi-session case per core workflow: same user, next day, referencing the first conversation.
Promote every real conversational failure from production into a scripted regression flow.

Multi-turn evaluation is more work than a prompt-response test set — more generation, more scoring, more reading of transcripts. But conversations are the product. Evaluating your agent one turn at a time is like reviewing a film one frame at a time: every frame can be in focus while the story makes no sense. For where this fits in a complete evaluation practice, see the AI agent evaluation guide.

FAQ

What is multi-turn evaluation?

Multi-turn evaluation tests an AI system across a whole conversation — or several conversations — instead of scoring isolated prompt-response pairs. It checks whether the system carries context, resolves references like 'the second one', handles goal changes, and stays consistent over time.

Why do single-turn evals miss failures?

Because most conversational failures are relational: the agent forgets what was said three turns ago, misresolves a pronoun, or contradicts its earlier answer. Every individual turn can look fine in isolation while the conversation as a whole fails the user.

Should I score every turn or the whole conversation?

Both, for different reasons. Turn-level scoring localizes where a conversation went wrong, which helps debugging. Conversation-level scoring measures what actually matters to the user — whether the goal was achieved. Relying on turn-level scores alone can rate a failed conversation as a series of fine turns.

How do you test conversations that span multiple sessions?

You need test users with persistent memory: the same simulated person returns days later and references the earlier interaction. Fixed test scripts can fake this with injected context; simulated users that genuinely remember across sessions test it for real.