Q13 of 21 · Testing AI systems

How do you validate the reliability of an LLM-as-judge setup?

Testing AI systemsSeniortesting-ai-systemsllm-as-judgecalibrationinter-rater-agreementevaluation

Short answer

Short answer: Calibrate the judge against a human-rated sample of 100–200 examples. Measure agreement using Cohen's kappa or Spearman correlation. A judge that agrees with humans at least 80% of the time is usable; below 70%, the judge is unreliable and will mislead your evaluation.

Detail

Using an LLM judge without calibrating it against humans is false measurement. The judge may have systematic biases — preferring verbose responses, being lenient on models from its own family — that make its scores look precise while being consistently wrong.

Calibration process:

  1. Assemble 100–200 diverse model outputs spanning the quality spectrum (clearly good, clearly bad, ambiguous).
  2. Have 2–3 human evaluators rate each output on your rubric independently.
  3. Compute inter-rater agreement among humans first — if humans disagree significantly, the rubric is underspecified.
  4. Have the judge rate the same set.
  5. Compute agreement between the judge and the human consensus.
  6. Identify where the judge systematically diverges (always lenient on length, always strict on format) and add rubric corrections.

Ongoing: re-calibrate when the judge model changes, when the rubric changes, or quarterly. A drift in judge reliability is a silent quality threat — you will think your product quality is stable while the judge's scoring has shifted. See Evaluating AI models.

// WHAT INTERVIEWERS LOOK FOR

Calibration against humans as the mandatory step. Specific agreement thresholds (80% usable, below 70% unreliable). Systematic bias identification. Quarterly re-calibration.