What does a layered test strategy look like for an AI system?

Question

Accepted Answer

Unit tests for deterministic logic, component tests per LLM call (mock the model, test the surrounding code), integration tests for the full pipeline against a golden eval set, and production monitoring with property checks on live sampled traffic. Each layer has different speed, cost, and confidence trade-offs. The classic test pyramid does not map cleanly onto AI systems because the LLM itself is a black box with non-deterministic output. The adapted pyramid: Layer 1 — Unit tests: all deterministic code (parsers, formatters, routers, validators) tested in isolation. Fast, cheap, fully reliable. Layer 2 — Component tests with mocked LLM: test each component that calls the LLM using a pre-recorded or mocked response. Verifies that your prompt template, response parser, and error handling work correctly for known outputs. Fast and deterministic. Layer 3 — Integration / eval tests with real LLM: run the golden eval set against the full pipeline with a live model. Slower and carries API c

What does a layered test strategy look like for an AI system?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR