Back to Blog
On this page5 sections

// tutorial

What QA should log when testing AI features

qa.codesqa.codes · 13 June 2026 · 8 min read
IntermediateAI QAQA Engineers
ai-testingobservabilityllm

AI outputs vary, so a screenshot isn't a reproduction. Here's what to capture so an AI bug is reproducible instead of 'it said something weird once'.

part ofTesting AI products

Every QA habit for reproducing bugs assumes determinism: same steps, same result, screenshot proves it. AI features break that assumption. Run the same prompt twice and you get two different answers — so a screenshot of a bad one proves nothing, and "the AI said something weird" is an unactionable ticket that gets closed as not-reproducible. Testing AI well is as much about what you capture as what you check. This is the observability companion to evaluating an AI chatbot and prompt injection testing.

Why a screenshot isn't enough

With a deterministic bug, the inputs are implied by the steps. With an AI feature, the same visible input can produce wildly different outputs depending on things you can't see — the hidden system prompt, retrieved context, model version, temperature, and prior conversation. Capture only the visible output and you've recorded an effect with no recoverable cause. Reproduction requires capturing the whole input the model actually received, not just the part the user typed.

The minimum to make an AI bug reproducible

Capture these together, every time:

  • The full input to the model — not just the user's text, but the assembled prompt: the system prompt, any injected/retrieved context (RAG documents), and the conversation history. This is the single most-skipped, most-important item.
  • The exact output — verbatim text, copied not screenshotted, including any structured/tool-call parts.
  • Model and versiongpt-4o-2024-xx, claude-…, your fine-tune ID. "The AI" is not a version; behaviour changes across versions and a bug may be model-specific.
  • Generation parameters — temperature, max tokens, top-p, and especially whether anything is non-deterministic. A temperature-0 bug is far more reproducible than a temperature-0.9 one, and worth noting which you hit.
  • Retrieval/tool trace — for RAG or agents: which documents were fetched, which tools were called with what arguments. Often the bug is in retrieval (wrong/irrelevant context), not the model.
  • Timestamp + session/request ID — to correlate with backend logs and the provider's own logs.

With those, "it said something weird" becomes "given this exact prompt + this context on model X at temp 0, it produced this — expected that," which a developer can actually act on.

Distinguish the three failure layers

Good logging lets you locate where it broke, which is half the fix:

  • Retrieval layer: did it fetch the right context? (Garbage in → garbage out isn't a model bug.)
  • Model layer: given correct context, did the model still answer wrong (hallucination, ignored instruction)?
  • Application layer: did the app mangle a fine model response — bad parsing, truncation, wrong rendering, dropped formatting?

A logged trace tells you which layer; a screenshot tells you none.

Build it into the product, not just the test

The strongest move: argue for this logging to exist in the product, not just your test harness. If production logs the full prompt, context, model version, and output for AI interactions (with appropriate privacy handling), then production AI failures become reproducible too — and you can evaluate real traffic, not just your test cases. QA flagging "we can't debug AI incidents without this" is a high-value, early intervention. (Mind the privacy angle — prompts and context can contain personal data; logging needs the same care as any sensitive store.)

Where this fits

This is the observability backbone under the whole testing-AI-products series — it's what makes chatbot evaluation and prompt injection findings reproducible and actionable. The AI for QA hub and prompt library cover the broader toolkit.

What to capture for an AI bug

  • The FULL assembled input: system prompt + retrieved context + conversation history, not just user text
  • The exact output, copied verbatim (incl. tool-call/structured parts)
  • Model name + version + your fine-tune/config ID
  • Generation parameters (temperature etc.) and whether the run was non-deterministic
  • Retrieval/tool trace: documents fetched, tools called with arguments
  • Timestamp + session/request ID to correlate with backend logs
  • Push for this logging in the product, not just the test harness (with privacy handling)

// RELATED QA.CODES RESOURCES


// related

Tutorials·13 June 2026 · 9 min read

How I evaluate an AI chatbot before release

A practical evaluation pass for AI chat features: hallucinations, refusals, prompt injection, and the cases with no single right answer.

ai-testingllmevaluation
Tutorials·13 June 2026 · 9 min read

The hallucination test cases I run on AI features

Concrete test cases for AI hallucination — unanswerable questions, false premises, invented entities, citations — and how to judge answers with no 'correct' value.

ai-testingllmhallucinationtest-cases