What is statistically significant regression in an eval context and how do you detect it?

Question

Accepted Answer

A statistically significant regression is a quality score drop unlikely to be explained by sampling noise. Use a binomial or paired t-test to determine whether the score difference between two runs exceeds what would be expected by chance at p < 0.05. Without significance testing, eval scores produce false alarms and false confidence. A drop from 87% to 85% on 50 examples might be noise — two or three borderline cases scored differently by the judge. The same drop on 2,000 examples is almost certainly a real regression. Binomial test: for binary quality labels (pass/fail per example), compare the pass rate between baseline and current run using a two-proportion z-test. If p < 0.05, the drop is significant. Paired t-test: for continuous quality scores (1–5 per example), use a paired t-test on the per-example score differences between baseline and current run. Effect size matters too: a statistically significant drop of 0.1 points on a 5-point scale may not be practically meaningful. Rep

What is statistically significant regression in an eval context and how do you detect it?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR