Eval harness

AI & LLM Testing

// Definition

Software that runs an LLM-backed system against a dataset of inputs, scores the outputs against criteria (exact match, similarity, LLM-as-judge, custom rubric), and tracks how scores change across model versions, prompts, or code changes. Eval harnesses are to AI features what test runners are to deterministic code: the place CI calls into, the place regressions get caught, the place quality is measured rather than asserted. The 2026 ecosystem has fragmented rather than consolidated — Braintrust is eval-first, Langfuse is prompt-first (acquired by Clickhouse in January), Laminar is built for agent debugging, Arize Phoenix is OpenTelemetry-native. Most teams pick one platform per workflow rather than expecting one tool to cover everything.

// Related terms

LLM-as-judge
An evaluation pattern where one language model grades another model's output. The judge model is given the input, the output to evaluate, and a rubric — and returns a score or pass/fail verdict. Useful for evaluating qualities that are hard to test deterministically: tone, factual accuracy, helpfulness, refusal of unsafe requests. The catch is that judges are themselves LLMs with their own biases and failure modes — they need to be calibrated against human raters before you trust them at scale. Good for triage and trend-spotting; not a replacement for human eval on critical paths.
Golden dataset
A curated set of inputs paired with known-correct outputs, used to evaluate an AI system's performance over time. For an LLM-backed product, a golden dataset might be 100 representative user questions plus the ideal answer for each. You run the system against the dataset on every release and compare current output to the gold answer — either with exact match, similarity scoring, or LLM-as-judge. Without a golden dataset you have vibes, not evaluation. Building and maintaining one is foundational QA work for AI products.
Large Language Model (LLM)
A neural network trained on massive text datasets to predict the next word in a sequence. Modern LLMs like Claude, GPT-4, and Gemini can answer questions, write code, summarise documents, and follow multi-step instructions — but they don't 'know' anything, they predict plausible continuations from patterns in training data. This is why they sometimes produce confident-sounding falsehoods (hallucinations) and why prompt design matters so much. In QA, LLMs are useful for generating test scaffolding, summarising bug reports, and drafting documentation — but their output always needs human review before it ships.