How do you validate the reliability of an LLM-as-judge setup?

Question

Accepted Answer

Calibrate the judge against a human-rated sample of 100–200 examples. Measure agreement using Cohen's kappa or Spearman correlation. A judge that agrees with humans at least 80% of the time is usable; below 70%, the judge is unreliable and will mislead your evaluation. Using an LLM judge without calibrating it against humans is false measurement. The judge may have systematic biases — preferring verbose responses, being lenient on models from its own family — that make its scores look precise while being consistently wrong. Calibration process: Assemble 100–200 diverse model outputs spanning the quality spectrum (clearly good, clearly bad, ambiguous). Have 2–3 human evaluators rate each output on your rubric independently. Compute inter-rater agreement among humans first — if humans disagree significantly, the rubric is underspecified. Have the judge rate the same set. Compute agreement between the judge and the human consensus. Identify where the judge systematically diverges (always

How do you validate the reliability of an LLM-as-judge setup?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR