Evaluation Dataset

AI & LLM Testing

// Definition

A curated set of input-output pairs used to measure an LLM application's correctness, safety, or consistency. Analogous to a regression test suite for traditional software. A well-maintained eval dataset covers the golden path (expected correct outputs), known edge cases, common failure modes (refusals, hallucinations, tone violations), and adversarial inputs. Datasets degrade over time as model behaviour changes; maintaining them is an ongoing engineering task, not a one-time setup. Often called an eval set or golden dataset.

// Related terms

Large Language Model (LLM)
A neural network trained on massive text datasets to predict the next word in a sequence. Modern LLMs like Claude, GPT-4, and Gemini can answer questions, write code, summarise documents, and follow multi-step instructions — but they don't 'know' anything, they predict plausible continuations from patterns in training data. This is why they sometimes produce confident-sounding falsehoods (hallucinations) and why prompt design matters so much. In QA, LLMs are useful for generating test scaffolding, summarising bug reports, and drafting documentation — but their output always needs human review before it ships.
Hallucination
When an AI model generates output that is fluent, confident, and completely wrong. In QA work this often looks like an LLM inventing a method that doesn't exist on a real API, citing a documentation page that was never written, or producing a test assertion that doesn't actually verify the behaviour described in the prompt. Hallucinations aren't a bug — they're a consequence of how language models work, predicting likely text rather than retrieving facts. The mitigations are: ground the model in real context (paste the actual API spec, not its name), verify generated code by running it, and treat any AI-produced reference (URLs, function names, citations) as untrusted until checked.
Regression Test
A test that verifies previously fixed bugs haven't returned and existing features still work after new changes. Forms the safety net for refactoring and feature work.
Safety Testing (LLM)
Verifying that an LLM application refuses to generate harmful, illegal, or policy-violating content and resists adversarial attempts to elicit such content. Distinct from functional testing (does the feature work?) and performance testing. Covers: jailbreaking attempts, prompt injection payloads, outputs that violate content policies (PII leakage, instructions for illegal activity), and over-refusal (the model refusing legitimate requests to the point of being useless). A safety eval suite should run on every model upgrade and before production release.