Q14 of 21 · Testing AI systems
What is statistically significant regression in an eval context and how do you detect it?
Short answer
Short answer: A statistically significant regression is a quality score drop unlikely to be explained by sampling noise. Use a binomial or paired t-test to determine whether the score difference between two runs exceeds what would be expected by chance at p < 0.05.
Detail
Without significance testing, eval scores produce false alarms and false confidence. A drop from 87% to 85% on 50 examples might be noise — two or three borderline cases scored differently by the judge. The same drop on 2,000 examples is almost certainly a real regression.
Binomial test: for binary quality labels (pass/fail per example), compare the pass rate between baseline and current run using a two-proportion z-test. If p < 0.05, the drop is significant. Paired t-test: for continuous quality scores (1–5 per example), use a paired t-test on the per-example score differences between baseline and current run. Effect size matters too: a statistically significant drop of 0.1 points on a 5-point scale may not be practically meaningful. Report both p-value and effect size (Cohen's d) so the team can make an informed decision.
For small eval sets (under 100 examples), statistical power is low — real regressions will be missed. Invest in growing the eval set to at least 200 examples before relying on significance testing. See Evaluation methods.