Q21 of 21 · Testing AI systems

How do you scale eval coverage without re-running every prompt against every model change?

Testing AI systemsLeadtesting-ai-systemsevaluationscalabilityeval-setleadershipefficiency

Short answer

Short answer: Maintain a tiered eval set: a small fast tier for every change, a medium tier for pre-release, and the full set for major model or architecture changes. Use change-impact tagging to run only the eval cases relevant to a given change.

Detail

Running a 10,000-example eval set on every PR is impractical — too slow and too expensive. But running only 20 examples misses important regressions. A tiered eval strategy with intelligent sampling solves this.

Tiered evaluation:

  • PR tier (20–50 examples, under 2 minutes): core happy paths and known critical failure cases. Runs on every change. Blocks merge on failure.
  • Pre-release tier (200–500 examples, 10–20 minutes): stratified sample across all input types, edge cases, and known failure modes. Runs before any deployment. Blocks release on statistically significant regression.
  • Full audit (1,000–10,000 examples): runs when the model version, architecture, or major prompt template changes. Triggered manually or on a weekly schedule.

Change-impact tagging: tag each eval example with the feature area it covers (retrieval, summarisation, classification). When a change only affects the retrieval layer, run only retrieval-tagged examples. This reduces eval cost by 60–80% for focused changes — but requires discipline in keeping tags current.

Coverage accounting: actively expand the eval set for uncovered input areas, just as you would track code coverage for deterministic software. See Eval platforms and tooling.

// WHAT INTERVIEWERS LOOK FOR

Three-tier model with different triggers. Change-impact tagging to run affected examples only. Coverage accounting for the eval set. Speed/confidence trade-off at each tier.