Q19 of 38 · Test design
How does test design change for an AI/ML model output vs deterministic code?
Short answer
Short answer: Deterministic code: assert exact outputs, use EP/BVA on inputs, branch coverage. ML models: assert *distributions* and *invariants* (output stays in valid range, monotonic in expected direction, robust to small perturbations), monitor drift in production, and use property-based testing more than example-based.
Detail
Testing an ML model output is fundamentally different because the model isn't a deterministic function — it's a learned approximation, and the right answer for a given input is usually probabilistic.
What changes:
Assertions become invariants, not equalities. Deterministic:
assert classify(image) == "cat". ML:assert classify(image).confidence > 0.5for the obvious case;assert classify(rotated_image).top_class == classify(image).top_classfor invariance under rotation.Test data becomes the test suite. A deterministic suite has 50 test cases. An ML test suite has hundreds or thousands of input-output pairs (a labelled dataset), and the metric is aggregate (accuracy, F1, precision/recall by class), not per-case.
EP/BVA become slice-based testing. Instead of "test one value per equivalence class", you test slices of the input distribution: model performance on rare classes, on minority demographic groups (fairness), on out-of-distribution inputs. Each slice has its own metrics.
Property-based and metamorphic testing dominate. Properties: "the output should be monotonic in price." Metamorphic: "if I add a benign augmentation (resize, mild rotation), the output should not change drastically." These are testable invariants without knowing the exact correct answer.
Robustness testing is first-class. Adversarial inputs: tiny perturbations to the input should not flip the output. This is unique to ML — deterministic code doesn't have this property to test.
Production monitoring substitutes for some test coverage. Drift detection: are the input distribution and output distribution today different from training? Performance monitoring: is the model's accuracy in production stable? Many ML failures aren't catchable by pre-deployment tests; they only manifest under data drift.
Test sets need refresh. The world changes; a 3-year-old test set may not reflect current production data.
A senior interview answer also acknowledges: uncertainty as a test target (calibrated confidences); latency, cost, and model size as part of the contract; and reproducibility (random seeds, data versioning, pipeline determinism).