Topic

AI evals

Evaluation, from foundations to practice: what evals are, how LLM judges work (and where they're biased), which metrics matter for agents, and what public benchmarks do and don't tell you.

June 29, 20267 min read

How to Test AI Agents Before Production

How to test AI agents before production: a 7-step method — define success, build a realistic user population, run multi-turn tests, score, and gate.

Agent testing AI evals

June 22, 20268 min read

LLM-as-a-Judge: The Definitive Guide

How LLM as a judge works: judge designs, writing rubric prompts, the known biases (position, verbosity, self-preference) and how to mitigate each.

AI evals

June 18, 20267 min read

AI Agent Evaluation: Metrics, Methods, and Framework

A practical guide to AI agent evaluation: outcome vs. trajectory metrics, four evaluation methods, and a step-by-step framework for running agent evals.

AI evals Agent testing

June 11, 20267 min read

What Are AI Evals? A Plain-English Guide

AI evals explained: what an eval is (task + data + scoring), how LLM evals differ from agent evals, and how to write your first 20 eval cases.

AI evals

June 8, 20266 min read

Multi-Turn Evaluation: Testing the Whole Conversation

Why single-turn evals miss real failures, and how multi-turn evaluation works: scripted flows, simulated users, and conversation-level scoring.

AI evals Agent testing

June 4, 20266 min read

AI Agent Benchmarks Explained: τ-bench, GAIA, SWE-bench

What AI agent benchmarks actually measure — τ-bench, GAIA, SWE-bench — what scores tell you about your own agent, and how to build an internal benchmark.

AI evals

May 21, 20266 min read

Eval-Driven Development for AI Agents

Eval-driven development means writing evals before you build, iterating against them, and gating releases on results. How the loop works — and its limits.

AI evals Agent engineering