Q21 of 21 · Testing AI systems
How do you scale eval coverage without re-running every prompt against every model change?
Short answer
Short answer: Maintain a tiered eval set: a small fast tier for every change, a medium tier for pre-release, and the full set for major model or architecture changes. Use change-impact tagging to run only the eval cases relevant to a given change.
Detail
Running a 10,000-example eval set on every PR is impractical — too slow and too expensive. But running only 20 examples misses important regressions. A tiered eval strategy with intelligent sampling solves this.
Tiered evaluation:
- PR tier (20–50 examples, under 2 minutes): core happy paths and known critical failure cases. Runs on every change. Blocks merge on failure.
- Pre-release tier (200–500 examples, 10–20 minutes): stratified sample across all input types, edge cases, and known failure modes. Runs before any deployment. Blocks release on statistically significant regression.
- Full audit (1,000–10,000 examples): runs when the model version, architecture, or major prompt template changes. Triggered manually or on a weekly schedule.
Change-impact tagging: tag each eval example with the feature area it covers (retrieval, summarisation, classification). When a change only affects the retrieval layer, run only retrieval-tagged examples. This reduces eval cost by 60–80% for focused changes — but requires discipline in keeping tags current.
Coverage accounting: actively expand the eval set for uncovered input areas, just as you would track code coverage for deterministic software. See Eval platforms and tooling.