ReferenceAdvanced5-7 min reference

LLM Evaluation

You can't assert output === expected on an LLM — the same prompt gives different wording each time. Evaluation means scoring outputs on dimensions with methods that tolerate non-determinism. This sheet is the quick reference; for the wider testing approach see Testing AI Systems and AI in Testing (linked below).

Dimensions to score

Dimension	Asks	Fail signal
Accuracy / correctness	Is the answer factually right?	Wrong facts
Groundedness / faithfulness	Supported by the provided context?	Unsupported claims (hallucination)
Relevance	Does it answer the actual question?	Off-topic, padding
Completeness	Covers what was asked?	Missing key parts
Safety	Free of harmful/biased/PII output?	Toxic, leaks, bias
Format	Obeys the required structure/JSON?	Schema/format break
Consistency	Stable across reruns?	Wild variation

Evaluation methods

Method	How	Best for
Reference-based	Compare to a gold answer (exact, embedding similarity)	Tasks with known answers
LLM-as-judge	A model scores the output against a rubric	Open-ended quality at scale
Rule / assertion	Regex, schema, must-contain/avoid	Format, safety keywords
Human review	People rate on a rubric	Ground truth, calibration

Making it reliable

Build a golden / evaluation dataset of inputs + expected behaviour.
Score over many samples; report pass rate, not one run.
Set a threshold (e.g. ≥90% faithful) and gate CI on it.
Calibrate LLM-as-judge against human labels; it's not infallible.

Common mistakes

Exact-match assertions on free-form text.
Judging one sample and calling it passed (ignore non-determinism).
LLM-as-judge with a vague rubric → noisy scores.
No golden dataset, so "better" is a vibe, not a number.

// Related resources

LLM Evaluation

Dimensions to score

Evaluation methods

Making it reliable

Common mistakes

Glossary

Related cheat sheets