ReferenceAdvanced5-7 min reference
LLM Evaluation
You can't assert output === expected on an LLM — the same prompt gives different wording each time. Evaluation means scoring outputs on dimensions with methods that tolerate non-determinism. This sheet is the quick reference; for the wider testing approach see Testing AI Systems and AI in Testing (linked below).
Dimensions to score
| Dimension | Asks | Fail signal |
|---|---|---|
| Accuracy / correctness | Is the answer factually right? | Wrong facts |
| Groundedness / faithfulness | Supported by the provided context? | Unsupported claims (hallucination) |
| Relevance | Does it answer the actual question? | Off-topic, padding |
| Completeness | Covers what was asked? | Missing key parts |
| Safety | Free of harmful/biased/PII output? | Toxic, leaks, bias |
| Format | Obeys the required structure/JSON? | Schema/format break |
| Consistency | Stable across reruns? | Wild variation |
Evaluation methods
| Method | How | Best for |
|---|---|---|
| Reference-based | Compare to a gold answer (exact, embedding similarity) | Tasks with known answers |
| LLM-as-judge | A model scores the output against a rubric | Open-ended quality at scale |
| Rule / assertion | Regex, schema, must-contain/avoid | Format, safety keywords |
| Human review | People rate on a rubric | Ground truth, calibration |
Making it reliable
- Build a golden / evaluation dataset of inputs + expected behaviour.
- Score over many samples; report pass rate, not one run.
- Set a threshold (e.g. ≥90% faithful) and gate CI on it.
- Calibrate LLM-as-judge against human labels; it's not infallible.
Common mistakes
- Exact-match assertions on free-form text.
- Judging one sample and calling it passed (ignore non-determinism).
- LLM-as-judge with a vague rubric → noisy scores.
- No golden dataset, so "better" is a vibe, not a number.
// Related resources