How would you test a feature powered by an LLM, given the same input can produce different outputs?

Question

Accepted Answer

Clarify the feature's job first, then layer: deterministic parts get normal unit tests; LLM outputs get property checks (schema, format, banned content, groundedness); LLM-as-judge handles quality rubrics; a golden eval set tracks regression. Manage risk statistically, not as single-run pass/fail. Clarify first: what is this feature supposed to do? Is there a right answer, or a range? What's the cost of a bad output? A customer-facing summary where hallucinated facts could mislead users is a higher bar than an internal draft reviewed by a human before publishing. Layer the tests: Deterministic parts (input parsing, output formatting, routing logic) get normal unit tests. Per-call output quality: property checks — required fields present, length within bounds, no PII in the response, claims grounded in the source. Rubric-based quality: LLM-as-judge evaluates helpfulness, accuracy, and tone, sampled and spot-checked against humans. Regression: a golden eval set of representative inputs.

How would you test a feature powered by an LLM, given the same input can produce different outputs?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR