An aggregate score hides the people you let down. Break results down by cohort — age, language, income, life situation — and see exactly who your agent works for, and who it quietly leaves behind.
A 92% overall looks fine until you split it by who you're serving. Break results down by age, language preference, income and more — and the cohorts you're failing stop hiding behind the average.
Because the population is the whole city, you cover the people you'd never think to write a test for: the ESL caller, the 65-year-old with low digital literacy, the one who distrusts AI, the night-shift single parent.
ESL · benefits65+ · low digital literacydistrusts AIcode-switchessingle parent · night shiftscreen readerrural · low bandwidthskips follow-upsdisputes politelypanics under stressUse our model-graded rubric for task success, consistency, satisfaction and safety — or skip it entirely and export every transcript into the evals you already run. Every run also streams as OpenTelemetry (OTLP) traces, so results land straight in the observability and eval stack you already have. Scoring is one lens, never a black box.
Any segment of the population — age band, language preference, income level, education, household type, or a behavior. Because every citizen has structured attributes, results can be sliced any way you need.
By default we use a model-graded rubric covering dimensions like task success, consistency, satisfaction and safety. You can configure the rubric, or skip our scoring and export transcripts to your own evals.
Yes. Pull transcripts over MCP or the REST API into whatever grader or eval framework you already use, or stream every run as OpenTelemetry (OTLP) traces straight into your observability and eval stack — Langfuse, Phoenix, Datadog, or your own collector. Our scoring is one option, not a requirement.
Cohorts that score meaningfully below the overall are flagged so you can see where the agent quietly underperforms instead of reading it off an average.
Yes — run the population sweep from CI over the API or CLI on every change, and fail the build when a cohort drops below the bar you set.
See exactly which cohorts your agent fails — before your users find out.