On this page4 sections
ReferenceAdvanced5-7 min reference

LLM Evaluation

You can't assert output === expected on an LLM — the same prompt gives different wording each time. Evaluation means scoring outputs on dimensions with methods that tolerate non-determinism. This sheet is the quick reference; for the wider testing approach see Testing AI Systems and AI in Testing (linked below).

Dimensions to score

DimensionAsksFail signal
Accuracy / correctnessIs the answer factually right?Wrong facts
Groundedness / faithfulnessSupported by the provided context?Unsupported claims (hallucination)
RelevanceDoes it answer the actual question?Off-topic, padding
CompletenessCovers what was asked?Missing key parts
SafetyFree of harmful/biased/PII output?Toxic, leaks, bias
FormatObeys the required structure/JSON?Schema/format break
ConsistencyStable across reruns?Wild variation

Evaluation methods

MethodHowBest for
Reference-basedCompare to a gold answer (exact, embedding similarity)Tasks with known answers
LLM-as-judgeA model scores the output against a rubricOpen-ended quality at scale
Rule / assertionRegex, schema, must-contain/avoidFormat, safety keywords
Human reviewPeople rate on a rubricGround truth, calibration

Making it reliable

  • Build a golden / evaluation dataset of inputs + expected behaviour.
  • Score over many samples; report pass rate, not one run.
  • Set a threshold (e.g. ≥90% faithful) and gate CI on it.
  • Calibrate LLM-as-judge against human labels; it's not infallible.

Common mistakes

  • Exact-match assertions on free-form text.
  • Judging one sample and calling it passed (ignore non-determinism).
  • LLM-as-judge with a vague rubric → noisy scores.
  • No golden dataset, so "better" is a vibe, not a number.