Q11 of 21 · Testing AI systems

How do you test for model or prompt-version drift when the underlying LLM changes?

Testing AI systemsSeniortesting-ai-systemsmodel-driftversion-driftregressionevaluationllm

Short answer

Short answer: Re-run your golden eval set against the new model or prompt version and compare aggregate quality scores using a significance test. A statistically significant drop in any dimension flags a regression before the change reaches production.

Detail

Model drift is the change in output quality, style, or behaviour when the LLM provider updates the model or when you change your prompt template. Neither change is inherently a regression — but either can break your feature in unexpected ways.

Testing workflow for a model change:

  1. Run the full golden eval set against the current model and record baseline scores.
  2. Switch to the new model in a shadow environment.
  3. Run the same eval set against the new model.
  4. Compute the significance of the score difference per dimension (accuracy, format compliance, length, groundedness).
  5. Review cases where the score changed most sharply — both regressions and unexpected improvements, which may indicate the rubric is incomplete.

For prompt changes: use the same process. Even a minor prompt wording change can shift output distribution significantly. A/B test with traffic shadowing if the eval set is too small to detect the expected effect size.

Automate this as a pre-release gate. See Evaluation methods and Eval platform decision.

// WHAT INTERVIEWERS LOOK FOR

Eval set as the regression mechanism, not manual QA. Statistical significance testing, not raw score comparison. Shadow environment for model comparison. Both prompt and model version as drift sources.