On this page10 sections
ConceptsAdvanced8-10 min reference

Testing AI Systems

Testing an LLM or AI feature is not like testing deterministic software: the same input can produce different outputs, "correct" is a spectrum, and a passing example proves little. This sheet covers how to evaluate AI systems on their own terms. It's the inverse of AI in Testing (using AI to test) — here the AI is the system under test.

Why testing AI is different

  • Non-determinism — the same prompt can yield different outputs (non-determinism); a single run tells you almost nothing. Evaluate over a dataset and look at rates.
  • No single right answer — quality is graded (relevant/faithful/safe), not pass/fail. You score outputs, not assert equality.
  • Hallucination — models state false things confidently (hallucination); plausibility ≠ correctness.
  • Prompt sensitivity — small wording changes shift behaviour; guard against prompt regression when you edit prompts.
  • New attack surfaceprompt injection and jailbreaks are security issues unique to LLM apps.

What to test

DimensionQuestion
Quality / correctnessAre answers relevant, accurate, complete?
Faithfulness (RAG)Is the answer grounded in the retrieved context, not invented?
SafetyDoes it refuse harmful requests and resist jailbreaks?
RobustnessDoes it hold up across paraphrases, edge cases, adversarial input?
ConsistencySimilar inputs → similar quality, run to run?
Cost / latencyTokens and response time within budget?

Evaluation methods

MethodHowWhen
Reference-basedCompare to a known-good answer (exact, similarity, metrics)You have ground-truth labels
LLM-as-judgeA model scores outputs against a rubricScaling subjective quality without humans
Human evaluationPeople rate outputsGround truth, calibration, high-stakes
Assertion / rule-basedRegex, schema, must-contain/​must-notFormat, safety keywords, structure

Build a labelled eval dataset of representative inputs (incl. edge and adversarial cases), run the system over it, and score with one or more methods. The dataset is your test suite; grow it from real failures.

Testing RAG systems

Retrieval-augmented generation has two failure points — retrieval and generation — so test both:

  • Retrieval: are the right documents fetched? (context precision/recall)
  • Faithfulness: is the answer supported by the retrieved context, or invented?
  • Answer relevance: does it actually address the question?

Tools like Ragas and DeepEval provide these RAG metrics out of the box.

Testing agents

Agents (multi-step, tool-using) add behavioural testing on top of output testing:

  • Does it pick the right tool and call it with valid arguments?
  • Does it recover from a tool error or bad result?
  • Does it terminate (no infinite loops) and stay within step/cost budgets?
  • Trace the full run — agent bugs hide in the steps, not just the final answer.

Safety and red-teaming

  • Red-team the system with adversarial prompts: jailbreaks, prompt injection, data-exfiltration attempts, harmful-content requests.
  • Assert it refuses what it should and doesn't leak its system prompt or tools.
  • Treat safety testing as a first-class, regression-guarded suite — not a one-off.

Observability for LLM apps

Production LLM behaviour drifts and surprises, so trace it: capture prompts, completions, tokens, latency, tool calls and user feedback. Agent observability tools (Langfuse, LangSmith, Phoenix, Laminar) let you debug real failures and mine production traffic for new eval cases.

Evals in CI

Treat evals like tests: run the eval dataset on every prompt/model/code change and gate on score thresholds and regression (don't let a prompt edit silently drop faithfulness). Because runs vary, gate on aggregate scores over the dataset, not a single output. Pin model versions so a provider-side model update doesn't quietly change results.

The tool landscape

NeedTools
LLM eval frameworksDeepEval, Ragas, PromptFoo, OpenAI Evals, TruLens, Giskard
Eval + observability platformsLangSmith, Langfuse, Arize Phoenix, Laminar, Braintrust
App frameworksLangChain, LlamaIndex
ML lifecycle / experiment trackingMLflow, Weights & Biases, Great Expectations

Quick AI-testing checklist

  • A labelled eval dataset of representative + edge + adversarial inputs
  • Evaluation method chosen per dimension (reference / LLM-judge / human / rules)
  • Scored over the dataset as rates, not judged from single runs
  • RAG: retrieval, faithfulness and answer-relevance tested separately
  • Agents: tool choice, error recovery, termination and traces checked
  • Safety/red-team suite for jailbreaks and prompt injection
  • Tracing/observability capturing prompts, outputs, tokens, feedback
  • Evals run in CI, gating on thresholds + regression
  • Model versions pinned so provider updates don't silently change results
  • Eval set grows from real production failures