May 21, 20266 min read

Eval-Driven Development for AI Agents

Eval-driven development means writing evals before you build, iterating against them, and gating releases on results. How the loop works — and its limits.

Alex Gvozden

AI evals Agent engineering

Eval-driven development (EDD) means writing evaluations before — or while — you build an AI feature, iterating against them during development, and gating merges and releases on the results. It's test-driven development adapted for systems that are probabilistic instead of deterministic: the assertions become statistical, and the test set has to keep growing.

The TDD analogy, done honestly

The pitch writes itself: unit tests made software engineering rigorous, so evals will do the same for AI. Directionally right — and worth being precise about, because the differences change how you work.

What transfers from TDD:

Write the check before the code. Defining "what does success look like?" before building forces the specification conversation that prompt-tweaking postpones forever.
Fast feedback beats careful guessing. With a good eval suite, "did this prompt change help?" takes minutes, not a week of vibes.
Red → green → refactor. Reproduce a failure as a case, make it pass, confirm nothing else broke.
Gates keep quality monotonic. CI that blocks a merge when evals regress is the single highest-leverage practice in this whole article.

What doesn't transfer:

Determinism. A unit test that passes, passes. An eval passes at some rate. One run of one case is an anecdote; you need repeated runs and thresholds ("≥90% over 20 runs"), not checkmarks. (This is its own discipline — see regression testing non-deterministic agents.)
Binary oracles. assertEquals becomes an end-state assertion where possible, an LLM judge where it isn't — and judges must themselves be calibrated against human labels, a maintenance cost unit tests never had.
100% green. Agent eval suites hover at 70–95% by design; the frontier cases should be failing. The signal isn't "all pass" — it's the trend, and which cohorts move.
A finished test suite. Unit tests go stale slowly. Eval sets go stale fast, because the model, the users, and the product all drift. EDD is as much about curating the set as running it.

If evals are new territory entirely, start with what are AI evals; this article assumes the basics.

The loop

The day-to-day workflow looks like this:

Specify. Before building the feature (or the fix), write eval cases that define success: inputs, expected end states, rubric criteria. Ten thoughtful cases beat a hundred generated ones.
Baseline. Run the suite against the current system. Failing cases for a new capability are correct and expected — that's your red.
Build against the suite. Every prompt edit, tool change, or model swap gets an eval run. This is where reproducibility pays rent: if inputs, seeds, and model versions aren't pinned, you can't attribute a score change to your change. Synthetic Signals leans on this hard — same seed, same city, same people — because comparing run 47 to run 46 is only meaningful if the test population held still.
Gate. Merges require no regression on the core suite; releases require thresholds on the full suite. Wire it into CI like any other check.
Feed back. Every failure that reaches production becomes a permanent case. Close the loop or the suite decays into a museum.

The eval-set lifecycle

Treat the eval set as a living asset with an explicit lifecycle — seeded, grown, audited, and pruned.

Seed it from three sources. Real production failures are the highest-value cases you will ever have: each one is a documented, user-verified gap. Real production successes protect against regressions on the happy path. And synthetic cases cover what logs can't: the users you don't have yet, the rare-but-costly situations, the demographic cohorts your beta testers didn't include. This is where a synthetic population is useful even pre-launch — you can generate realistic users and conversations before you have any traffic to mine.

Grow it continuously. New feature → new cases, before the feature merges. New production incident → new case, in the same PR as the fix. New user segment → new synthetic users representing it. A healthy eval set's git history looks like the product's git history.

Retire deliberately. Cases go stale: the feature was removed, the behavior specification changed, the case tested a model quirk that no longer exists. Stale cases are worse than dead weight — they train the team to ignore red. Audit quarterly; delete or update without sentimentality. Version the set so scores stay comparable across changes ("suite v12: 91%" means something; "the suite: 91%" doesn't, if the suite silently changed).

Guarding against overfitting to your own evals

Here's the failure mode nobody warns you about: a team six months into EDD, eval scores climbing every sprint, production quality flat. They didn't improve the agent — they taught it their test.

It happens innocently. Every prompt tweak is selected because it helps the eval set; over hundreds of iterations that's an optimization process targeting those exact cases, as surely as gradient descent. The defenses are the same ones ML research uses:

Hold out a set you never iterate against. Run it rarely — before releases, not during development. A widening gap between dev-set and held-out scores is your overfitting alarm.
Refresh inputs while keeping intent. Same scenario, new phrasing, new surface details. If scores drop sharply on paraphrases, the agent memorized wording, not behavior.
Test against fresh populations. This is a structural advantage of population-based testing: generate a new cohort of synthetic users — new people, same demographic distribution — that neither the team nor the agent has ever seen, and evaluate against them. With a seeded population, that's a one-line change: a new seed produces a statistically similar but entirely unseen city. Development iterates against one population; releases get judged by strangers.
Watch cohort spread, not just the average. Overfitting often shows up as the average rising while coverage across cohorts narrows — the agent gets better at the users who dominate the eval set and quietly worse at everyone else. Break scores down by user type before celebrating.

Making evals cheap enough to run constantly

EDD lives or dies on iteration speed. If the suite costs $40 and forty minutes, it runs nightly and the "driven" part of eval-driven quietly dies. Tactics:

Tier the suite. A smoke tier (10–30 cases, minutes, every commit), a core tier (hundreds, pre-merge), a full tier (everything, incl. large simulated-user runs, pre-release and nightly).
Cheap scorers first. Run assertions and string/state checks before invoking judge models; skip the judge when the assertion already failed. Use a small model as judge where it agrees with your calibrated large judge.
Cache aggressively. Unchanged component + unchanged input = reusable result. Content-hash your prompts and fixtures.
Parallelize. Eval cases are embarrassingly parallel; wall-clock time is a solved problem if you let it be.
Sample smartly. Every commit doesn't need every case. Rotate a random slice of the core tier through the smoke tier so drift anywhere gets noticed within days.

Honest limits

EDD is a discipline, not a guarantee. Evals only measure what you thought to encode, so they systematically miss unknown-unknowns — that's what simulation-based exploration and production monitoring are for. Judge-based scores carry judge error, so "the suite improved 2%" within judge noise is not a result. And a team can pass every gate and still ship the wrong product; evals verify behavior, not product-market fit. Keep humans reading transcripts weekly — the suite tells you whether things regressed, humans notice what kind of wrong is new. For the step-by-step of standing all this up before a launch, see how to test AI agents before production.

The teams that get value from EDD share one habit: they treat their eval set with the same care as their source code — reviewed, versioned, refactored, and grown with every incident. Do that, and the loop compounds. Skip it, and you have a dashboard, not a development method.

FAQ

What is eval-driven development?

Eval-driven development (EDD) is a workflow for building AI systems where evaluations are written before or alongside the feature, every change is iterated against them, and merges and releases are gated on eval results — the AI-era analog of test-driven development.

How is eval-driven development different from TDD?

TDD tests deterministic code with binary assertions; EDD tests probabilistic systems with statistical criteria. Evals rarely all pass, results need repeated runs to be meaningful, and the eval set must keep growing from real failures — a fixed suite goes stale in a way unit tests don't.

Where do eval cases come from?

Three sources: real production failures (the highest-value cases), representative production successes (regression protection), and synthetic cases that cover situations you haven't seen yet — rare user types, edge-case goals, and conversations you can't harvest from logs.

How do you stop an agent from overfitting to its eval set?

Keep a held-out set you never iterate against, refresh test inputs periodically, and evaluate against fresh synthetic users or populations the developers have never seen. If held-out scores lag your development-set scores, you are tuning to the test, not improving the agent.