Back to Blog
On this page4 sections

// tutorial

The hallucination test cases I run on AI features

qa.codesqa.codes · 13 June 2026 · 9 min read
IntermediateQA EngineersAI product QA
ai-testingllmhallucinationtest-cases

An AI feature that makes things up isn't a rare edge case — it's the default failure mode. Here are the concrete test cases I run to find hallucinations before users do, and how to judge an answer that has no single 'correct' value.

part ofTesting AI products

Hallucination — an AI confidently stating something false — is the bug that makes testing AI features genuinely different from testing deterministic software. There's no exception in the logs, no stack trace, no failing assertion. The output looks fluent and authoritative and it's wrong. You can't catch it with traditional asserts, so you need deliberate test cases aimed at the situations where models invent things. This builds directly on how I evaluate an AI chatbot before release.

Where hallucinations come from

Models make things up most reliably in a few predictable situations, and good test cases target each:

  • Questions with no answer in the data. Ask about something the system has no information on. The correct behaviour is "I don't know" or "I can't find that" — not a confident invention. This is the single highest-value test.
  • Plausible-but-false premises. Ask a question that assumes a fact that isn't true ("When did we add the X feature?" when there's no X feature). Does it correct the premise or play along and elaborate?
  • Specifics: numbers, dates, names, citations. Models love to fabricate precise-looking detail — a version number, a policy date, a source link, a person's title. Anything citable is a hallucination hotspot.
  • Out-of-scope requests. Push just past what the feature is for. Does it stay in its lane or improvise authoritatively about something it shouldn't touch?
  • Long or multi-turn context. Earlier facts get dropped or contradicted several turns in. Test whether it stays consistent with what was established.

The concrete test cases

Hallucination test cases

  • Unanswerable question: ask something genuinely not in the data — expect a graceful "I don't know," not a guess
  • False-premise question: embed an untrue assumption — expect a correction, not elaboration
  • Made-up entity: ask about a product/person/feature that doesn't exist — it must not describe it as real
  • Citation check: when it cites a source, number, or date, verify every one against ground truth
  • Boundary push: a request just outside scope — expect a refusal or redirect, not confident improvisation
  • Consistency: ask the same factual question two ways (and again later in the conversation) — answers must agree
  • Refusal honesty: confirm "I can't help with that" appears where it should and isn't bypassed by rephrasing
  • Grounding leak: for RAG/retrieval features, confirm the answer is actually supported by the retrieved context, not the model's own memory

How to judge an answer with no single "correct" value

This is the part that trips up testers coming from deterministic systems. Most AI outputs don't have one right string, so you can't assert equality. Three approaches that work:

  1. Assert on properties, not exact text. "Does it refuse?" "Does it cite a real source?" "Is the claimed number correct?" "Did it avoid inventing an entity?" These are checkable even when the wording varies. This is the same shift in mindset as what to log when testing AI — you verify behaviour and grounding, not a golden string.
  2. Use a curated set with known ground truth. Maintain a fixed list of questions where you know the right answer (or know the right answer is "I don't know"). Run it every release and check the model against reality, not against its own confidence.
  3. Separate "wrong" from "differently worded." A regression is a factually worse answer, not a reworded one. Don't fail a test because the phrasing changed; fail it because the fact changed.

Why this can't be fully automated away

You can automate a lot of this — a ground-truth question bank, property checks, even using another model as a grader — but hallucination testing keeps a stubborn human core, because deciding whether a fluent, confident paragraph is actually true often requires someone who knows the domain. That's a feature of the problem, not a gap in your tooling. Budget for human review of AI outputs as an ongoing cost, weight your test cases toward the high-stakes claims (anything a user would act on), and treat "sounds right" as the start of a check, never the end of one.

// RELATED QA.CODES RESOURCES


// related

Tutorials·13 June 2026 · 9 min read

How I evaluate an AI chatbot before release

A practical evaluation pass for AI chat features: hallucinations, refusals, prompt injection, and the cases with no single right answer.

ai-testingllmevaluation
Tutorials·13 June 2026 · 8 min read

What QA should log when testing AI features

A screenshot isn't a repro when outputs vary. Capture the full assembled prompt, retrieved context, model version, and parameters so an AI bug is actually reproducible.

ai-testingobservabilityllm