How do you test a feature whose output isn't deterministic — the same input can produce different text each time?
JuniorStop asserting equality. Assert properties the output must hold (format, grounding, contains required facts, excludes forbidden content) and evaluate against a graded rubric rather than a fixed string.
// What interviewers look for
The mental shift from exact-match assertions to property- and eval-based checks: you test characteristics of a good answer, not one canonical answer.
Common pitfall
Trying to assert the output equals a fixed expected string, producing a flaky test that fails on every valid rewording.
Model answer
The core shift is that there's no single correct output, so exact-match assertions are wrong by construction — they'd flag every valid rephrasing as a failure. Instead I assert properties a good answer must satisfy: structural ones (valid JSON, required fields, length bounds), content ones (mentions the required facts, stays grounded in the provided context, doesn't include forbidden or unsafe content), and behavioural ones (refuses out-of-scope requests). For subjective quality I use a graded rubric — often an LLM-as-judge or human rating against criteria — and assert a score threshold across an eval set rather than pass/fail on one example. I'd run multiple samples per input to characterise the distribution, since one input has a range of outputs, and watch variance, not just the mean. I'd also pin what should be deterministic (temperature 0 for extraction tasks) so I'm only tolerating non-determinism where it's inherent. The principle is testing the qualities of a correct answer, not a canonical answer.