How do you scale eval coverage without re-running every prompt against every model change?

Question

Accepted Answer

Maintain a tiered eval set: a small fast tier for every change, a medium tier for pre-release, and the full set for major model or architecture changes. Use change-impact tagging to run only the eval cases relevant to a given change. Running a 10,000-example eval set on every PR is impractical — too slow and too expensive. But running only 20 examples misses important regressions. A tiered eval strategy with intelligent sampling solves this. Tiered evaluation: PR tier (20–50 examples, under 2 minutes): core happy paths and known critical failure cases. Runs on every change. Blocks merge on failure. Pre-release tier (200–500 examples, 10–20 minutes): stratified sample across all input types, edge cases, and known failure modes. Runs before any deployment. Blocks release on statistically significant regression. Full audit (1,000–10,000 examples): runs when the model version, architecture, or major prompt template changes. Triggered manually or on a weekly schedule. Change-impact tagging:

How do you scale eval coverage without re-running every prompt against every model change?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR