Synthetic Signals
← Platform
Coverage

Find the cohort you're failing.

An aggregate score hides the people you let down. Break results down by cohort — age, language, income, life situation — and see exactly who your agent works for, and who it quietly leaves behind.

Overall
92
Age 18–34
94
Age 65+
64
English
91
Spanish-pref
71
Low income
78
By cohortscores broken down, not averaged
100%of the population, not a sample
Auto-flagfailing cohorts surfaced for you
BYO evalsour rubric or your own
Segmentation

Scores by cohort, not just an average

A 92% overall looks fine until you split it by who you're serving. Break results down by age, language preference, income and more — and the cohorts you're failing stop hiding behind the average.

  • Every result attributable to a real segment
  • Failing cohorts flagged automatically
  • See who your agent works for — and who it doesn't
  • The same cohort lens powers interview and survey reports
Overall
92
Age 18–34
94
Age 65+
64
English
91
Spanish-pref
71
Low income
78
The long tail

The edge cases you didn't test

Because the population is the whole city, you cover the people you'd never think to write a test for: the ESL caller, the 65-year-old with low digital literacy, the one who distrusts AI, the night-shift single parent.

  • Every language level and life situation
  • Behaviors that break brittle happy-path agents
  • Surface failures before launch, not after
Edge cases in the population
ESL · benefits65+ · low digital literacydistrusts AIcode-switchessingle parent · night shiftscreen readerrural · low bandwidthskips follow-upsdisputes politelypanics under stress
Scoring

Score your way, or bring your own

Use our model-graded rubric for task success, consistency, satisfaction and safety — or skip it entirely and export every transcript into the evals you already run. Every run also streams as OpenTelemetry (OTLP) traces, so results land straight in the observability and eval stack you already have. Scoring is one lens, never a black box.

  • Model-graded rubric out of the box
  • Export transcripts (via API or MCP) into your own evals
  • Stream runs as OpenTelemetry (OTLP) — Langfuse, Phoenix, Datadog
  • Configurable rubric — not a fixed score
Task success
95
Consistency
92
Satisfaction
95
Safety
88
or pull transcripts over MCP into your own evals
Questions

How coverage and scoring work.

What counts as a cohort?

Any segment of the population — age band, language preference, income level, education, household type, or a behavior. Because every citizen has structured attributes, results can be sliced any way you need.

How are conversations scored?

By default we use a model-graded rubric covering dimensions like task success, consistency, satisfaction and safety. You can configure the rubric, or skip our scoring and export transcripts to your own evals.

Can I use my own evals?

Yes. Pull transcripts over MCP or the REST API into whatever grader or eval framework you already use, or stream every run as OpenTelemetry (OTLP) traces straight into your observability and eval stack — Langfuse, Phoenix, Datadog, or your own collector. Our scoring is one option, not a requirement.

How are failing cohorts identified?

Cohorts that score meaningfully below the overall are flagged so you can see where the agent quietly underperforms instead of reading it off an average.

Can I run coverage in CI?

Yes — run the population sweep from CI over the API or CLI on every change, and fail the build when a cohort drops below the bar you set.

Stop shipping for the average user.

See exactly which cohorts your agent fails — before your users find out.