How do you test for model or prompt-version drift when the underlying LLM changes?

Question

Accepted Answer

Re-run your golden eval set against the new model or prompt version and compare aggregate quality scores using a significance test. A statistically significant drop in any dimension flags a regression before the change reaches production. Model drift is the change in output quality, style, or behaviour when the LLM provider updates the model or when you change your prompt template. Neither change is inherently a regression — but either can break your feature in unexpected ways. Testing workflow for a model change: Run the full golden eval set against the current model and record baseline scores. Switch to the new model in a shadow environment. Run the same eval set against the new model. Compute the significance of the score difference per dimension (accuracy, format compliance, length, groundedness). Review cases where the score changed most sharply — both regressions and unexpected improvements, which may indicate the rubric is incomplete. For prompt changes: use the same process. Ev

How do you test for model or prompt-version drift when the underlying LLM changes?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR