June 11, 20267 min read

What Are AI Evals? A Plain-English Guide

AI evals explained: what an eval is (task + data + scoring), how LLM evals differ from agent evals, and how to write your first 20 eval cases.

Alex Gvozden

AI evals

An AI eval is a repeatable test for an AI system: a defined task, a dataset of input cases, and a scoring method that measures how well the outputs meet your criteria. Evals exist because you can't unit-test probabilistic outputs the classic way — instead of asserting one exact answer, you measure quality across many cases.

The three parts of an eval

Every eval, from a five-line script to a full evaluation platform, reduces to the same three components:

A task. What you're asking the system to do: answer a support question, summarize a document, book an appointment, refactor a function.
Data. A set of input cases to run the task against — real user queries, hand-written examples, synthetic conversations, or a mix. Sometimes each case also carries a reference answer or expected outcome.
A scoring method. A rule for turning each output into a judgment: pass/fail, a 1–5 score, a comparison against a reference, or a human label.

Run the task over the data, apply the scoring method, and you get a result you can track over time: "the new prompt passes 84 of 100 cases; the old one passed 78." That's an eval. Everything else — dashboards, judges, CI gates — is tooling around this loop.

Why evals exist at all

Classic software is deterministic: add(2, 2) returns 4 every time, so a unit test can assert exact equality. LLMs and agents are probabilistic. The same prompt can produce different phrasings on different runs, and many of those phrasings are equally correct. Exact-match assertions either fail constantly (too strict) or get so loose they stop meaning anything.

Evals are the adaptation. Instead of one input and one asserted output, you use many inputs and a scoring method that tolerates acceptable variation. The unit of confidence shifts from "this test passed" to "this behavior holds at this rate across this distribution of cases."

That shift has a second consequence people miss: because results are rates, sample size and case selection matter. Ten cases tell you almost nothing about a system that will meet thousands of different users. Which cases you include quietly decides which users you're testing for — a point we'll come back to under mistakes.

The three main scoring approaches

How you score outputs is the biggest design decision in an eval. There are three families, and mature teams use all three in different places.

Approach	How it works	Cost / speed	Best for	Main weakness
Exact & assertion checks	Code checks the output: exact match, regex, JSON schema, "contains X", "tool Y was called"	Very cheap, instant, deterministic	Structured outputs, tool calls, format contracts, factual lookups with one right answer	Can't judge open-ended quality; brittle to harmless rephrasing
LLM-as-a-judge	Another model scores the output against a rubric or reference	Cheap-ish, fast, scales to thousands of cases	Open-ended quality: helpfulness, tone, faithfulness, instruction-following	Has known biases; needs calibration against human labels
Human review	People label or grade outputs	Expensive, slow	Ground truth, calibrating judges, high-stakes or ambiguous cases	Doesn't scale; reviewers disagree with each other too

A useful rule of thumb: assert what you can, judge what you can't, and spot-check the judge with humans. If a property can be checked in code — the response is valid JSON, the refund tool was called with the right amount — check it in code. Reserve LLM judges for the genuinely fuzzy qualities, and use a small set of human labels to verify the judge agrees with people. Our LLM-as-a-judge guide covers that calibration loop in detail.

LLM evals vs. agent evals

Most eval writing online is about LLM evals: one prompt in, one response out, score the response. That model breaks down for agents.

An agent doesn't produce a response; it produces a trajectory — a multi-step sequence of decisions. It reads a message, decides whether to call a tool, interprets the result, maybe calls another tool, asks a clarifying question, and carries state across a conversation that can span many turns. Evaluating agents means evaluating that whole trajectory:

Tool use. Did it call the right tools, with the right arguments, in a sensible order? Did it recover when a tool errored?
State. Did the world end up in the right condition — the ticket updated, the meeting booked, the record unchanged when it should be unchanged?
Conversation. Did it hold up across turns — remember what the user said earlier, handle a correction, survive an ambiguous follow-up?

The practical consequence: a single-response eval can look great while the agent fails. Each individual reply reads fine; the ten-turn conversation still ends with the user's problem unsolved. Agent evals therefore need multi-turn test conversations and trajectory-level scoring, not just per-response grades. We go deeper on this in multi-turn evaluation and the broader agent evaluation guide.

Offline vs. online evals

Evals run in two places, and they answer different questions.

Offline evals run before release, against a fixed dataset, in CI or on a laptop. Because inputs are fixed, results are comparable across versions — this is where you catch regressions and compare prompts, models, or architectures. Their weakness: they only cover cases you thought to include.

Online evals run against live traffic — scoring sampled production responses, tracking user feedback, watching task-completion signals. They cover the real distribution, including inputs you never imagined. Their weakness: by the time an online eval catches a failure, a real user has already experienced it.

You need both, in that order. Offline evals are the gate; online evals are the smoke detector. The most useful habit is connecting them: every failure your online monitoring surfaces should become a new offline case, so it can never ship again silently.

Starting out: your first 20 eval cases

Teams stall on evals because they imagine a thousand-case benchmark. Don't. Twenty good cases beat zero perfect ones, and the set compounds from there.

Write down the top tasks. The 5–10 things users actually come to your system for. Not features — tasks, phrased the way a user would.
Collect 2–4 cases per task. Real production queries if you have them; realistic hand-written ones if you don't. Vary phrasing, tone, and detail.
Include known-hard cases. Anything that has already broken in a demo, an internal test, or a support escalation goes in the set.
Define pass criteria per case. One or two sentences: what must be true for this output to count as good? If you can't write it, you've found an ambiguity in the product, which is worth knowing too.
Score the cheapest valid way. Assertions where possible, a simple LLM judge with your pass criteria as the rubric where not.
Run it on every change — prompt edits, model swaps, tool changes — and add a case every time something new breaks.

That last step is the whole game. An eval suite isn't a document you write once; it's a ratchet that accumulates every failure you've ever seen.

Common mistakes

Overfitting to the eval set. If you iterate on prompts while staring at the same 20 cases, you will eventually tune for those 20 cases, the way models overfit training data. Keep a held-out set you look at rarely, and keep adding fresh cases from real usage.

One aggregate score hiding cohort failures. "87% pass" sounds like one fact but averages many populations. An agent can score 95% with fluent, patient, well-specified English queries and far worse with terse messages, non-native phrasing, or edge-case account states — and the aggregate won't show it. Break results down by cohort — user type, language, task, difficulty — before trusting any single number. This is the entire premise behind cohort coverage: the question isn't "what's the score," it's "who is it failing."

Testing single prompts when the product is a conversation. If users reach your system through multi-turn dialogue, one-shot evals measure something you don't ship.

Trusting an uncalibrated judge. An LLM judge is a model with its own failure modes. Until you've checked it against human labels on a sample, its scores are a hypothesis, not a measurement.

Non-reproducible runs. If two runs of the same eval give different results and you can't tell why, you can't attribute changes to your code. Pin versions, control temperature where you can, fix seeds where the tooling allows, and rerun enough times to know your noise floor. Reproducibility is what turns a one-off failure into a permanent regression test.

How this connects to agent testing

Everything above applies whether your "data" is a CSV of prompts or something richer. For conversational agents, the data component of the eval is the hard part: you need test users, not test strings. That's the gap Synthetic Signals works on — it generates the population side of the eval (thousands of distinct, Census-grounded synthetic users who hold real multi-turn conversations, with reproducible seeds so failures replay exactly) and lets you bring your own scoring method, whether that's assertions, an LLM judge, or your own metric. The eval loop is the same one described here; the input distribution just looks a lot more like production. For the full methodology, see how to test AI agents before production.

The takeaway

An eval is task + data + scoring, run repeatedly. Start with 20 cases, score them the cheapest valid way, run them on every change, and grow the set with every failure you find. Watch for the two failure modes that quietly invalidate results — overfitting to a static set, and averages that hide who you're failing — and evals stop being a research topic and start being what they actually are: the test suite for probabilistic software.

FAQ

What is an AI eval?

An AI eval is a repeatable test for an AI system, made of three parts: a task, a dataset of input cases, and a scoring method. You run the model or agent against the cases and measure how well the outputs meet your criteria — the AI equivalent of a test suite.

How are evals different from unit tests?

Unit tests assert one exact output for one input, which works for deterministic code. AI models are probabilistic — many different outputs can be correct — so evals score quality across many cases and report rates and distributions instead of a single pass/fail.

What is the difference between LLM evals and agent evals?

LLM evals score a single response to a single prompt. Agent evals score a multi-step trajectory: which tools were called, how state changed, and how the conversation went across turns. An agent can produce fluent individual replies and still fail the overall task.

How many eval cases do I need to start?

Around 20 hand-picked cases is a genuinely useful starting point. Pull them from real or expected user tasks, include a few known-hard cases, and grow the set every time you find a new failure.