June 4, 20266 min read

AI Agent Benchmarks Explained: τ-bench, GAIA, SWE-bench

What AI agent benchmarks actually measure — τ-bench, GAIA, SWE-bench — what scores tell you about your own agent, and how to build an internal benchmark.

Bob Miagi

AI evals

AI agent benchmarks are standardized task suites for comparing agents: τ-bench tests tool-using customer-service agents against simulated users and policies, GAIA tests general assistants on human-easy, model-hard questions, and SWE-bench tests resolving real GitHub issues. They're good for comparing models — and poor evidence that your agent works for your users.

Why agent benchmarks exist

LLM benchmarks (MMLU-style question sets) stopped being informative for agents almost immediately: answering multiple-choice questions says nothing about whether a system can plan, call tools, hold a conversation, and follow rules. So a second generation of benchmarks emerged that scores behavior over a trajectory rather than a single response. Three of them come up in nearly every agent conversation, and they measure genuinely different things.

τ-bench: can your agent survive a real conversation?

What it is. τ-bench (tau-bench), from Sierra, benchmarks Tool-Agent-User interaction in two customer-service-like domains: retail and airline. The agent gets a set of domain API tools and a policy document (what it may and may not do — e.g., rules about cancellations or exchanges). The twist: the "user" is simulated by an LLM, so the agent faces a dynamic multi-turn conversation with evolving requests, not a fixed prompt.

What it scores. Success is judged by the final database state and the required information reaching the user — did the right orders get modified, per policy? Its signature metric is pass^k: the probability the agent succeeds on the same task across all of k independent runs. That's a reliability metric, not an ability metric, and it's brutal — in the original results, agents that looked respectable on pass^1 collapsed as k grew (GPT-4o dropped to roughly 25% at pass^8 on the retail domain). An agent that succeeds "usually" fails pass^k, exactly the way it would fail a real deployment.

Why it matters. τ-bench validated two ideas the rest of the field now takes seriously: simulated users are a legitimate way to test agents (more on that in user simulation for AI agents), and one-shot success rates overstate agent quality because they ignore consistency.

GAIA: human-easy, model-hard

What it is. GAIA (from Meta AI, Hugging Face, and collaborators) is a benchmark for general AI assistants: 466 real-world questions requiring reasoning, web browsing, multi-modal handling, and tool use. Questions are split into three levels — Level 1 solvable roughly without tools, Level 2 requiring tool use, Level 3 requiring long multi-step sequences.

What it scores. Exact-match answers to questions that are conceptually simple for people but require chaining lookups and reasoning. The paper's headline framing: human respondents scored about 92%, GPT-4 with plugins around 15% at release. That inversion — trivial for humans, brutal for models — is the point; most benchmarks before it were the other way around.

Why it matters. GAIA became the default yardstick for "general agent" and deep-research-style systems. Its limitation is equally clear: it tests solitary question-answering with tools, not conversation, not policy compliance, not side effects on real systems.

SWE-bench: can it fix real code?

What it is. SWE-bench (Princeton) asks whether a system can resolve real GitHub issues: 2,294 problems drawn from issues and their corresponding pull requests across 12 popular open-source Python repositories. The agent gets the issue text and the repository, and must produce a patch.

What it scores. The patch is applied and the repo's real unit tests are run, using the post-PR behavior as reference. No judge, no rubric — the test suite decides. Because raw SWE-bench contains some ambiguous or unfairly-specified issues, SWE-bench Verified — a 500-issue subset human-validated for solvability — became the standard reporting set.

Why it matters. It's the most execution-grounded of the three and the de facto scoreboard for coding agents. It's also the most gamed: it covers one language, a fixed set of repos, and issues that existed publicly on GitHub — which leads directly to the next section.

What benchmark scores don't tell you

A benchmark score is a measurement of one distribution of tasks. Your agent runs on a different one. The gaps are systematic, not random:

Task mismatch. Your support agent doesn't answer GAIA trivia; your internal tooling agent doesn't file airline exchanges. A model's τ-bench score tells you about the model's conversational tool use, not about your prompts, your tools, your policies, or your retrieval setup — which is most of what determines whether your agent works.
Contamination and overfitting. Benchmark tasks are public. They leak into training data, and labs tune against them — sometimes explicitly, sometimes just by the gravity of the leaderboard. Scores drift up over time faster than real-world capability does. SWE-bench, built from public GitHub history, is especially exposed.
The average hides the variance. Most leaderboards report pass@1-style averages. τ-bench's pass^k showed how much that flatters agents; the same flattery applies to your internal one-run eval. (This is the core of regression testing non-deterministic agents.)
No population dimension. Benchmarks vary the task; they rarely vary the person. τ-bench's simulated users are a start, but no public benchmark asks whether your agent works as well for a 70-year-old non-native speaker as for the median tester. Production asks exactly that.

None of this makes benchmarks useless. They're the right tool for one decision — which base model or architecture should I build on? — and the wrong tool for the decision that follows: does my agent work?

Building an internal benchmark that reflects your users

The honest response to benchmark limits isn't cynicism; it's building the benchmark that public suites can't build for you — one shaped like your production traffic. The recipe:

Take the task distribution from reality. Enumerate what users actually ask your agent, weighted roughly by frequency, plus the rare-but-costly cases. Seed from logs where you have them; synthesize where you don't.
Take the user distribution from reality too. Vary who's asking — age, language, patience, tech fluency, context — not just what's asked. This is where population-based testing comes in: instead of twenty hand-written test users, run the agent against thousands of distinct simulated people with real demographic grounding. Synthetic Signals's approach is to synthesize a whole city from Census data so the test population's shape resembles a real one instead of the test author's imagination.
Adopt the benchmark ideas worth stealing. Score end state, not phrasing (SWE-bench). Use simulated users for multi-turn realism (τ-bench). Report pass^k-style consistency, not one-run success (τ-bench again).
Report coverage, not a single number. Break results down by task type and by user cohort, so the output is "here's who it fails for," not "here's the score." See cohort coverage for what that view looks like.
Freeze it, then grow it. Pin seeds, model versions, and tasks so runs are comparable over time, and add every new production failure as a case. Your internal benchmark should be a living regression suite, not an annual report.

A useful mental model: public benchmarks are standardized exams; your internal benchmark is the job interview. Passing the exam gets a model considered. Only the interview — tasks like yours, users like yours, scored the way you define success — tells you whether to ship. For how internal benchmarks fit into a full evaluation practice, see the AI agent evaluation guide.

Quick reference

Benchmark	Tests	Interaction	Scored by	Signature idea
τ-bench	Customer-service tool agents	Multi-turn, simulated user	Final DB state + policy	pass^k reliability
GAIA	General assistants	Single task, tools/web	Exact-match answer	Human-easy, model-hard
SWE-bench	Coding agents	Repo + issue, no user	Real unit tests	Execution-grounded, 2,294 real issues

Use them to pick your foundation. Then build the benchmark that actually looks like your users — because that's the one your launch depends on.

FAQ

What is τ-bench?

τ-bench (tau-bench) is a benchmark from Sierra that tests tool-agent-user interaction in customer-service-style domains (retail and airline). An LLM-simulated user converses with the agent, which must complete tasks via API tools while following a domain policy document. Its pass^k metric measures whether the agent succeeds consistently across repeated runs, not just once.

What does the GAIA benchmark measure?

GAIA measures general AI assistant ability on 466 real-world questions requiring reasoning, web browsing, multi-modality, and tool use, split across three difficulty levels. Its defining property is that the questions are conceptually easy for humans and hard for models — in the original paper, humans scored about 92% while GPT-4 with plugins scored around 15%.

What is SWE-bench?

SWE-bench evaluates whether an agent can resolve real GitHub issues. It contains 2,294 problems drawn from issues and pull requests across 12 popular Python repositories; a proposed fix passes if it makes the repository's real unit tests pass. SWE-bench Verified is a 500-issue, human-validated subset.

Do benchmark scores predict how well an agent will work for my users?

Only loosely. Benchmarks measure performance on their own task distribution, which is almost never your task distribution, your policies, or your users. They are useful for comparing models; they are not evidence that your specific agent works. For that you need an internal benchmark built from your own tasks and user population.