Golden dataset

AI & LLM Testing

// Definition

A curated set of inputs paired with known-correct outputs, used to evaluate an AI system's performance over time. For an LLM-backed product, a golden dataset might be 100 representative user questions plus the ideal answer for each. You run the system against the dataset on every release and compare current output to the gold answer — either with exact match, similarity scoring, or LLM-as-judge. Without a golden dataset you have vibes, not evaluation. Building and maintaining one is foundational QA work for AI products.

// Related terms