How do you build a golden eval set for an LLM feature and use it to detect regressions?

Question

Accepted Answer

Curate 50–200 representative input/quality-expectation pairs spanning happy paths, edge cases, and known failure modes. Baseline the current model on this set, then flag runs where the aggregate quality score drops statistically significantly after any model, prompt, or retrieval change. A golden eval set is to LLM testing what a regression test suite is to deterministic software — a fixed set of cases that should continue to pass as the system evolves. Building one: Collect inputs: sample real production queries stratified by type (short/long, domain, language). Include near-misses where the model has historically struggled — don't just include easy examples. Define expected quality: for each input, record whether the output must be accurate, a required format, within length constraints, grounded. Store human-rated scores for a sample as calibration. Baseline the current model: run the set and record aggregate quality score per dimension. Regression detection: after any model change (

How do you build a golden eval set for an LLM feature and use it to detect regressions?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR