Q2 of 21 · Testing AI systems
How would you test a feature powered by an LLM, given the same input can produce different outputs?
Short answer
Short answer: Clarify the feature's job first, then layer: deterministic parts get normal unit tests; LLM outputs get property checks (schema, format, banned content, groundedness); LLM-as-judge handles quality rubrics; a golden eval set tracks regression. Manage risk statistically, not as single-run pass/fail.
Detail
Clarify first: what is this feature supposed to do? Is there a right answer, or a range? What's the cost of a bad output? A customer-facing summary where hallucinated facts could mislead users is a higher bar than an internal draft reviewed by a human before publishing.
Layer the tests:
- Deterministic parts (input parsing, output formatting, routing logic) get normal unit tests.
- Per-call output quality: property checks — required fields present, length within bounds, no PII in the response, claims grounded in the source.
- Rubric-based quality: LLM-as-judge evaluates helpfulness, accuracy, and tone, sampled and spot-checked against humans.
- Regression: a golden eval set of representative inputs. Flag statistically significant drops in quality score — not single-run noise.
Adversarial: prompt injection attempts, jailbreak probes, inputs designed to trigger hallucination.
Close: the eval harness is a first-class deliverable. Human-in-the-loop for high-stakes outputs. Manage risk statistically, not as a binary pass/fail. See Evaluation methods for the full taxonomy.