Q5 of 21 · Testing AI systems

How do you build a golden eval set for an LLM feature and use it to detect regressions?

Testing AI systemsMidtesting-ai-systemseval-setregressiongolden-setbenchmarkllm

Short answer

Short answer: Curate 50–200 representative input/quality-expectation pairs spanning happy paths, edge cases, and known failure modes. Baseline the current model on this set, then flag runs where the aggregate quality score drops statistically significantly after any model, prompt, or retrieval change.

Detail

A golden eval set is to LLM testing what a regression test suite is to deterministic software — a fixed set of cases that should continue to pass as the system evolves.

Building one:

  1. Collect inputs: sample real production queries stratified by type (short/long, domain, language). Include near-misses where the model has historically struggled — don't just include easy examples.
  2. Define expected quality: for each input, record whether the output must be accurate, a required format, within length constraints, grounded. Store human-rated scores for a sample as calibration.
  3. Baseline the current model: run the set and record aggregate quality score per dimension.

Regression detection: after any model change (new version, prompt change, retrieval change), re-run the eval set. Use a significance test — binomial for binary labels, paired t-test for continuous scores — rather than comparing raw percentages. A 2% drop on 50 examples might be noise; on 1,000 examples it is probably signal.

See Evaluation methods and Eval platforms and tooling.

// WHAT INTERVIEWERS LOOK FOR

Curated breadth (happy path + edge cases + known failures). Statistical significance testing rather than raw percentage comparison. Baseline before any change. Stratified sampling from real production queries.