Evaluation Dataset

AI & LLM Testing

// Definition

A curated set of input-output pairs used to measure an LLM application's correctness, safety, or consistency. Analogous to a regression test suite for traditional software. A well-maintained eval dataset covers the golden path (expected correct outputs), known edge cases, common failure modes (refusals, hallucinations, tone violations), and adversarial inputs. Datasets degrade over time as model behaviour changes; maintaining them is an ongoing engineering task, not a one-time setup. Often called an eval set or golden dataset.

// Related terms