Coverage

Find the cohort you're failing.

An aggregate score hides the people you let down. Break results down by cohort — age, language, income, life situation — and see exactly who your agent works for, and who it quietly leaves behind.

See coverage by cohort Start testing your agent

Overall

Age 18–34

Age 65+

English

Spanish-pref

Low income

By cohortscores broken down, not averaged

100%of the population, not a sample

Auto-flagfailing cohorts surfaced for you

BYO evalsour rubric or your own

Segmentation

Scores by cohort, not just an average

A 92% overall looks fine until you split it by who you're serving. Break results down by age, language preference, income and more — and the cohorts you're failing stop hiding behind the average.

Every result attributable to a real segment
Failing cohorts flagged automatically
See who your agent works for — and who it doesn't
The same cohort lens powers interview and survey reports

Overall

Age 18–34

Age 65+

English

Spanish-pref

Low income

The long tail

The edge cases you didn't test

Because the population is the whole city, you cover the people you'd never think to write a test for: the ESL caller, the 65-year-old with low digital literacy, the one who distrusts AI, the night-shift single parent.

Every language level and life situation
Behaviors that break brittle happy-path agents
Surface failures before launch, not after

Edge cases in the population

ESL · benefits65+ · low digital literacydistrusts AIcode-switchessingle parent · night shiftscreen readerrural · low bandwidthskips follow-upsdisputes politelypanics under stress

Scoring

Score your way, or bring your own

Use our model-graded rubric for task success, consistency, satisfaction and safety — or skip it entirely and export every transcript into the evals you already run. Every run also streams as OpenTelemetry (OTLP) traces, so results land straight in the observability and eval stack you already have. Scoring is one lens, never a black box.

Model-graded rubric out of the box
Export transcripts (via API or MCP) into your own evals
Stream runs as OpenTelemetry (OTLP) — Langfuse, Phoenix, Datadog
Configurable rubric — not a fixed score

Task success

Consistency

Satisfaction

Safety

or pull transcripts over MCP into your own evals

Questions

How coverage and scoring work.

What counts as a cohort?

Any segment of the population — age band, language preference, income level, education, household type, or a behavior. Because every citizen has structured attributes, results can be sliced any way you need.

How are conversations scored?

By default we use a model-graded rubric covering dimensions like task success, consistency, satisfaction and safety. You can configure the rubric, or skip our scoring and export transcripts to your own evals.

Can I use my own evals?

Yes. Pull transcripts over MCP or the REST API into whatever grader or eval framework you already use, or stream every run as OpenTelemetry (OTLP) traces straight into your observability and eval stack — Langfuse, Phoenix, Datadog, or your own collector. Our scoring is one option, not a requirement.

How are failing cohorts identified?

Cohorts that score meaningfully below the overall are flagged so you can see where the agent quietly underperforms instead of reading it off an average.

Can I run coverage in CI?

Yes — run the population sweep from CI over the API or CLI on every change, and fail the build when a cohort drops below the bar you set.

Stop shipping for the average user.

See exactly which cohorts your agent fails — before your users find out.

See coverage by cohort