Q13 of 21 · Testing AI systems
How do you validate the reliability of an LLM-as-judge setup?
Short answer
Short answer: Calibrate the judge against a human-rated sample of 100–200 examples. Measure agreement using Cohen's kappa or Spearman correlation. A judge that agrees with humans at least 80% of the time is usable; below 70%, the judge is unreliable and will mislead your evaluation.
Detail
Using an LLM judge without calibrating it against humans is false measurement. The judge may have systematic biases — preferring verbose responses, being lenient on models from its own family — that make its scores look precise while being consistently wrong.
Calibration process:
- Assemble 100–200 diverse model outputs spanning the quality spectrum (clearly good, clearly bad, ambiguous).
- Have 2–3 human evaluators rate each output on your rubric independently.
- Compute inter-rater agreement among humans first — if humans disagree significantly, the rubric is underspecified.
- Have the judge rate the same set.
- Compute agreement between the judge and the human consensus.
- Identify where the judge systematically diverges (always lenient on length, always strict on format) and add rubric corrections.
Ongoing: re-calibrate when the judge model changes, when the rubric changes, or quarterly. A drift in judge reliability is a silent quality threat — you will think your product quality is stable while the judge's scoring has shifted. See Evaluating AI models.