Eval Set

AI & LLM Testingintermediateaka Evaluation Set

// Definition

A curated collection of input/expected-output pairs used to measure an LLM system's quality on each change — the AI equivalent of a regression suite. Because model output is non-deterministic, you score the system against the whole set (pass rate, not a single exact match), which turns "did the prompt change help?" into a measurable answer instead of a vibe.

// Why it matters

Without an eval set, prompt and model changes are tested by spot-checking a few examples — which misses regressions and invites cherry-picking. An eval set makes LLM quality a number you can track across versions, the single biggest shift from "prompt tinkering" to engineering.

// How to test

// Run the system over the whole eval set; assert an aggregate pass rate,
// not exact strings (output is non-deterministic).
const results = await Promise.all(evalSet.map(async ({ input, expected }) => {
  const output = await runSystem(input)
  return judge(output, expected) // LLM-as-judge or rule-based scorer → boolean
}))
const passRate = results.filter(Boolean).length / results.length
expect(passRate).to.be.gte(0.9) // quality gate, e.g. 90%

// Common mistakes

  • Too few examples to be statistically meaningful (a 5-item set proves nothing)
  • Asserting exact-match output against a probabilistic model (flaky by design)
  • Letting the eval set go stale so it stops reflecting real usage

// Related terms