Prompt regression

AI & LLM Testing

// Definition

When a prompt change — or a model update underneath an unchanged prompt — silently degrades the quality of outputs your product depends on. Prompt regressions are particularly nasty because they don't throw errors and don't fail integration tests; the system keeps responding, just worse. The defence is a regression eval suite: a versioned set of test inputs with known-good outputs, run on every prompt change and every model upgrade, with scores tracked over time. Without this, a model provider's quiet behind-the-scenes update can degrade your product's quality and you won't notice until a user complains.

// Related terms

Eval harness
Software that runs an LLM-backed system against a dataset of inputs, scores the outputs against criteria (exact match, similarity, LLM-as-judge, custom rubric), and tracks how scores change across model versions, prompts, or code changes. Eval harnesses are to AI features what test runners are to deterministic code: the place CI calls into, the place regressions get caught, the place quality is measured rather than asserted. The 2026 ecosystem has fragmented rather than consolidated — Braintrust is eval-first, Langfuse is prompt-first (acquired by Clickhouse in January), Laminar is built for agent debugging, Arize Phoenix is OpenTelemetry-native. Most teams pick one platform per workflow rather than expecting one tool to cover everything.
Golden dataset
A curated set of inputs paired with known-correct outputs, used to evaluate an AI system's performance over time. For an LLM-backed product, a golden dataset might be 100 representative user questions plus the ideal answer for each. You run the system against the dataset on every release and compare current output to the gold answer — either with exact match, similarity scoring, or LLM-as-judge. Without a golden dataset you have vibes, not evaluation. Building and maintaining one is foundational QA work for AI products.
LLM-as-judge
An evaluation pattern where one language model grades another model's output. The judge model is given the input, the output to evaluate, and a rubric — and returns a score or pass/fail verdict. Useful for evaluating qualities that are hard to test deterministically: tone, factual accuracy, helpfulness, refusal of unsafe requests. The catch is that judges are themselves LLMs with their own biases and failure modes — they need to be calibrated against human raters before you trust them at scale. Good for triage and trend-spotting; not a replacement for human eval on critical paths.