Synthetic Signals
← Blog
8 min read

LLM-as-a-Judge: The Definitive Guide

How LLM as a judge works: judge designs, writing rubric prompts, the known biases (position, verbosity, self-preference) and how to mitigate each.

LLM-as-a-judge is an evaluation method where one language model grades the outputs of another — against a rubric, a reference answer, or a competing response. It works because judging text is easier than writing it, and it won because it's the only scoring method that scales like code while reading nuance like a human.

What it is, and why it won

Before LLM judges, teams evaluating open-ended output had two options: string metrics (BLEU, ROUGE, exact match) that miss meaning, or human review that costs real money and days of turnaround. LLM-as-a-judge splits the difference. You write a prompt that describes what "good" means, hand the judge an output (and optionally a reference or a rival output), and get back a structured verdict — in seconds, for thousands of cases, on every commit.

The core insight is asymmetry: verifying quality is an easier task than producing it. A model that can't reliably write a perfect support answer can still reliably notice that an answer ignored the customer's actual question. The approach was popularized by research like MT-Bench (Zheng et al., 2023), which found that strong judge models can agree with human preferences at levels comparable to human-human agreement — while also documenting the biases covered below. Both findings matter. A judge is a genuinely useful instrument, and it is an instrument with known systematic error.

If you're new to evals generally, start with what AI evals are — a judge is one of three scoring methods, and it should only get the cases that code-based assertions can't handle.

The four judge designs

1. Single-output scoring with a rubric

The judge sees one output plus a rubric and returns a score or pass/fail. This is the workhorse design: simple to run, produces per-case scores you can track over time, and maps directly onto CI-style gating ("fail the build if pass rate drops below X").

Its weakness is calibration drift: absolute scores are anchored only by the rubric's wording, so "4/5" can mean different things across judge models, model versions, or even rubric phrasings. Mitigate with anchored scales (described below) and by comparing scores only within a fixed judge setup.

2. Pairwise comparison

The judge sees two outputs for the same input and picks the better one (or declares a tie). Pairwise judgments are easier and more reliable than absolute scores — "which is better?" is a more natural question than "how good is this on a 1–10 scale?" — which is why arena-style leaderboards use them.

The cost: results are relative. Pairwise tells you prompt B beats prompt A; it doesn't tell you whether either is good enough to ship. Use it for A/B decisions (model swaps, prompt changes), not for absolute quality gates.

3. Reference-guided grading

The judge compares the output against a known-good reference answer and grades correctness or equivalence. This is the strongest design when references exist, because the judge's job shrinks from "evaluate quality" to "check semantic equivalence" — a much easier, less bias-prone task.

The limit is obvious: you need references, which means someone wrote them. Practical for factual QA, extraction, and tasks with canonical answers; impractical for open-ended generation.

4. Chain-of-thought / G-Eval-style scoring

The judge is instructed to reason before scoring: analyze the output against each criterion step by step, then emit the verdict. G-Eval (Liu et al., 2023) formalized this — generate evaluation steps from the criteria, walk through them, score with a structured form. Reasoning-then-scoring generally improves agreement with humans and, as a bonus, gives you an audit trail: when a case fails, the judge's reasoning tells you why, which turns the eval into a debugging tool.

The trade-offs are cost (more tokens per judgment) and rationalization risk — the reasoning the judge writes is not guaranteed to be the actual cause of its score. Treat the explanation as a lead, not as ground truth.

Design Output Best for Watch out for
Single-output + rubric Absolute score / pass-fail Tracking quality over time, CI gates Score calibration drift
Pairwise A vs. B preference Comparing prompts, models, versions Position bias; no absolute bar
Reference-guided Correctness vs. reference Factual QA, extraction Needs references
CoT / G-Eval Reasoned score Complex criteria, debugging failures Cost; rationalized explanations

Writing a good judge prompt

Most judge failures are prompt failures. The pattern that works:

Make the rubric concrete. "Rate helpfulness 1–5" produces noise. Define each level in behavioral terms: "5 = resolves the user's stated problem completely and anticipates the obvious follow-up; 3 = addresses the problem but requires the user to ask again for specifics; 1 = ignores or misreads the problem." If two humans reading your rubric would disagree on a case, the judge will be inconsistent on it too.

Anchor with few-shot examples. Include 2–4 graded examples — ideally real outputs with the scores a human gave them, including at least one borderline case. Examples calibrate the scale far better than adjectives do.

Force structured output. Have the judge emit JSON — {"score": 4, "reasoning": "...", "failed_criteria": [...]} — or at minimum a fixed final line you can parse. Free-text verdicts rot your pipeline.

One judgment per call. A judge asked to score helpfulness, accuracy, and tone in one pass blurs them together. Separate criteria into separate calls, or at least separate fields with separate rubric sections.

Reason first, score last. Ask for the analysis before the verdict, so the score is conditioned on the reasoning rather than the reasoning rationalizing a snap score.

The known biases — and their mitigations

These are documented, systematic, and mostly fixable. Ignore them and your eval numbers will be precise, confident, and wrong.

Bias What happens Mitigation
Position bias In pairwise comparisons, the judge favors whichever answer appears first (or last, depending on the model) Run every comparison twice with order swapped; count it a tie unless both orders agree
Verbosity bias Longer answers score higher, independent of quality State explicitly that length is not a virtue; penalize padding in the rubric; check score-vs-length correlation in your results
Self-preference / same-model bias A judge rates outputs from its own model family higher Judge with a different model family than the one being evaluated; or use a panel of judges from different families
Score clustering On numeric scales, the judge defaults to a narrow band (everything is a 7) Prefer pass/fail or small anchored scales (1–3, 1–5 with defined levels) over 1–10; use few-shot anchors spanning the full range
Sycophancy toward stated intent An answer that claims to follow instructions gets credit for following them Reference-guided grading where possible; rubric lines that check outcomes, not assertions

And the one universal mitigation: calibrate against humans. Take 50–100 cases, label them yourself (or with domain experts), run the judge on the same cases, and measure agreement. If judge and humans disagree substantially, fix the rubric and re-measure before trusting anything the judge produces at scale. Re-calibrate whenever you change the judge model or the rubric. A judge without a calibration set isn't a measurement — it's an opinion with a dashboard.

When not to use a judge

  • When code can check it. Format compliance, schema validity, exact values, whether a tool was called — assertions are cheaper, deterministic, and bias-free. Judges are for what assertions can't reach.
  • When the domain outruns the judge. Specialized medical, legal, or deep-technical correctness can exceed what a general judge model reliably knows. A confident judge in a domain it half-knows is worse than no judge.
  • When stakes require accountability. For decisions with real consequences, a judge can triage and pre-screen, but a human signs off.
  • When you can't calibrate. No human labels means no way to know whether the judge measures anything. Get labels first, even a small set.

Judging agents: turn-level vs. trajectory-level

Everything above was framed around single outputs. Agents complicate it, because an agent produces a trajectory — tool calls, state changes, and a conversation that unfolds over turns.

Turn-level judging scores each agent response in context: given the conversation so far, was this reply appropriate? It localizes failures ("turn 4 is where it lost the thread") and is the right level for debugging.

Trajectory-level judging scores the whole episode: did the user's task actually get done, in a reasonable number of steps, without side effects? This is the level that matters for shipping decisions, because agents can produce individually reasonable turns that add up to a failed task — polite, fluent, and never actually booking the appointment.

Use both: trajectory-level as the headline metric, turn-level as the diagnostic when a trajectory fails. Multi-turn test conversations are a prerequisite for either — see multi-turn evaluation for how to construct them, and how to test AI agents before production for where judging fits in the full testing loop.

One note on how we think about this at Synthetic Signals: scoring is deliberately bring-your-own. Synthetic Signals generates the test conversations — a synthetic population talking to your agent — and you define the lens: an LLM judge with your rubric, assertions, or a custom metric. There's no imposed "one true score," because the judge design that fits a support agent doesn't fit a scheduling agent. What the platform adds is the breakdown: whatever lens you bring gets reported by cohort, so a judge score isn't one average but a map of who passes and who doesn't.

The takeaway

LLM-as-a-judge earned its place: it's the only scoring method that combines near-human reading of nuance with per-commit economics. Treat it like any instrument — pick the right design for the question, write rubrics a human grader could follow, correct for the known biases, and calibrate against human labels before believing it. Do that, and a judge turns open-ended quality into something you can actually regress-test.

FAQ

What is LLM-as-a-judge?

LLM-as-a-judge is an evaluation technique where one language model scores the outputs of another against a rubric, a reference answer, or a competing output. It automates the kind of quality judgment that previously required human reviewers, at a fraction of the cost and time.

How accurate is LLM-as-a-judge?

It depends on the task, the rubric, and the judge model. Research on strong judge models has found agreement with human preferences comparable to agreement between humans on some tasks — but judges carry systematic biases, so you should always calibrate against a small human-labeled sample before trusting the scores.

What are the main biases of LLM judges?

The best-documented ones are position bias (favoring the first answer in a pairwise comparison), verbosity bias (favoring longer answers), self-preference bias (favoring outputs from the same model family), and score clustering (defaulting to a narrow band like 7/10). Each has a known mitigation.

When should you not use an LLM judge?

When the property can be checked in code (format, exact values, tool calls), when the domain requires expert knowledge the judge lacks, when stakes are high enough to require human accountability, and when you have no human labels to calibrate against.

Find where your agent breaks — before your users do.

Test your AI agent against a whole city of Census-grounded synthetic people, and see exactly which users it fails.