Q9 of 21 · Testing AI systems
What does a layered test strategy look like for an AI system?
Short answer
Short answer: Unit tests for deterministic logic, component tests per LLM call (mock the model, test the surrounding code), integration tests for the full pipeline against a golden eval set, and production monitoring with property checks on live sampled traffic. Each layer has different speed, cost, and confidence trade-offs.
Detail
The classic test pyramid does not map cleanly onto AI systems because the LLM itself is a black box with non-deterministic output. The adapted pyramid:
Layer 1 — Unit tests: all deterministic code (parsers, formatters, routers, validators) tested in isolation. Fast, cheap, fully reliable.
Layer 2 — Component tests with mocked LLM: test each component that calls the LLM using a pre-recorded or mocked response. Verifies that your prompt template, response parser, and error handling work correctly for known outputs. Fast and deterministic.
Layer 3 — Integration / eval tests with real LLM: run the golden eval set against the full pipeline with a live model. Slower and carries API cost. Run on pre-release, not every PR.
Layer 4 — Production monitoring: property checks on live sampled traffic. LLM-as-judge scoring on a daily sample. Alert on aggregate score drops.
See New test pyramid for AI for the full model.