The new test pyramid

8 min read · Reviewed May 2026 · framing

Testing deterministic software and testing LLM-backed software are different jobs. The pyramid metaphor still applies, but the layers are different — unit tests shrink, evaluation grows into its own discipline, and the definition of "correct" needs rethinking from the ground up. Get the mental model right before reaching for tools, or you will spend months automating the wrong things.

How testing LLM products differs from deterministic ones

The single-answer assumption built into most test frameworks does not hold for LLM output, and everything downstream of that assumption breaks with it.

Classical test automation rests on one implicit assumption: the same input produces the same output, every time. Unit tests, integration tests, and end-to-end suites all assert that given state X, the system produces state Y. When you introduce an LLM into the product, that assumption quietly collapses. The same prompt to the same model, even at temperature zero, can produce different tokens depending on hardware, batching, and numerical precision. At any non-zero temperature, variance is by design.

This breaks unit tests in the traditional sense. You cannot write an assertion that checks the exact string the model produced, because next Tuesday the string will be different. What you can assert is the plumbing around the model — the API calls, the context assembly, the retrieval pipeline, the post-processing. Those layers remain deterministic and are still worth covering. The mistake most teams make is trying to cover the model output the same way, and then watching their CI suite turn flaky and red.

The pyramid changes shape. At the bottom, you still have deterministic unit and integration tests — they cover the non-LLM code. Above that, you have evaluation: testing over distributions of inputs and expected quality criteria, not individual input/output pairs. At the top, you have adversarial and observability layers — red-teaming, failure-mode coverage, and production quality signals. The middle layer, evaluation, is where most of the interesting testing work now lives, and it is a distinct engineering discipline most QA engineers have never had to build before.

Behaviour tests grow to compensate. Instead of asserting a string, you assert a property: the response is grounded in the retrieved context, the response does not contain personally identifiable information, the response addresses the user's intent. These properties can be checked with LLM-as-judge, with rule-based filters, with semantic similarity — none of which look like the assertEqual calls your existing framework was built around.

Defining 'correct' when output varies

Correctness for LLM output is a range and a rubric, not a string — and the decision of how wide to set the range is itself a product decision.

Exact-match correctness is the wrong starting point for LLM output. The question is not "did the model produce this exact response?" but "is the response within the acceptable range of quality for this input?" That reframing changes what your test infrastructure looks like entirely. You need a definition of correctness that handles legitimate variance without treating every semantically equivalent paraphrase as a failure.

Three methods cover most cases. Exact-match is appropriate for structured outputs: JSON with a defined schema, a classification into a fixed set of labels, a specific piece of extracted data. Similarity-based correctness — embedding distance, BLEU score, or semantic similarity — works for free-text where multiple phrasings are equally valid, but be careful: a confident wrong answer can score high on similarity to a confident right one. Rubric-based correctness is the most powerful and the hardest to implement: you define evaluation criteria (groundedness, completeness, tone, factual accuracy) and score outputs against those criteria, either manually or with an LLM judge.

Tolerance budgets are the practical expression of this. For a given feature, you define the acceptable failure rate — what percentage of outputs can fall below the quality threshold before you consider the feature broken. This is not lowering the bar; it is being honest about the distribution nature of the output and making a product decision about acceptable risk. A customer-facing summarisation feature might tolerate 2% low-quality outputs. A medical triage feature might tolerate zero. Those are different numbers, and they should be explicit decisions, not implicit ones left to chance.

The golden dataset is where this becomes concrete. You curate a set of real production inputs with documented expected quality — not exact expected outputs, but quality criteria. You run every model or prompt change against that dataset and measure how many outputs fall below your tolerance budget. If the number goes up, something regressed. This is the equivalent of a failing unit test, expressed in the language of distributions.

Related glossary terms