Eval Set
// Definition
A curated collection of input/expected-output pairs used to measure an LLM system's quality on each change — the AI equivalent of a regression suite. Because model output is non-deterministic, you score the system against the whole set (pass rate, not a single exact match), which turns "did the prompt change help?" into a measurable answer instead of a vibe.
// Why it matters
Without an eval set, prompt and model changes are tested by spot-checking a few examples — which misses regressions and invites cherry-picking. An eval set makes LLM quality a number you can track across versions, the single biggest shift from "prompt tinkering" to engineering.
// How to test
// Run the system over the whole eval set; assert an aggregate pass rate,
// not exact strings (output is non-deterministic).
const results = await Promise.all(evalSet.map(async ({ input, expected }) => {
const output = await runSystem(input)
return judge(output, expected) // LLM-as-judge or rule-based scorer → boolean
}))
const passRate = results.filter(Boolean).length / results.length
expect(passRate).to.be.gte(0.9) // quality gate, e.g. 90%// Common mistakes
- Too few examples to be statistically meaningful (a 5-item set proves nothing)
- Asserting exact-match output against a probabilistic model (flaky by design)
- Letting the eval set go stale so it stops reflecting real usage
// Related terms
Golden dataset
A curated set of inputs paired with known-correct outputs, used to evaluate an AI system's performance over time. For an LLM-backed product, a golden dataset might be 100 representative user questions plus the ideal answer for each. You run the system against the dataset on every release and compare current output to the gold answer — either with exact match, similarity scoring, or LLM-as-judge. Without a golden dataset you have vibes, not evaluation. Building and maintaining one is foundational QA work for AI products.
Evaluation Dataset
A curated set of input-output pairs used to measure an LLM application's correctness, safety, or consistency. Analogous to a regression test suite for traditional software. A well-maintained eval dataset covers the golden path (expected correct outputs), known edge cases, common failure modes (refusals, hallucinations, tone violations), and adversarial inputs. Datasets degrade over time as model behaviour changes; maintaining them is an ongoing engineering task, not a one-time setup. Often called an eval set or golden dataset.
LLM-as-judge
An evaluation pattern where one language model grades another model's output. The judge model is given the input, the output to evaluate, and a rubric — and returns a score or pass/fail verdict. Useful for evaluating qualities that are hard to test deterministically: tone, factual accuracy, helpfulness, refusal of unsafe requests. The catch is that judges are themselves LLMs with their own biases and failure modes — they need to be calibrated against human raters before you trust them at scale. Good for triage and trend-spotting; not a replacement for human eval on critical paths.
RAG Evaluation
Measuring a Retrieval-Augmented Generation system on two axes that a plain answer-check misses: retrieval quality (did it fetch the right context?) and faithfulness (is the answer grounded in that context, or hallucinated despite it?). A RAG system can retrieve perfectly and still hallucinate, or answer correctly from the wrong source — so both must be scored separately.