RAG, agents, and observability
Most production AI products are RAG systems, agent workflows, or both. Each architecture has its own testing surface and its own failure modes. All of them need observability that can turn "the AI is acting weird" into a specific, fixable signal — because without it you are debugging a black box with user complaints as your only telemetry.
Testing RAG systems
RAG has two failure surfaces — retrieval quality and generation faithfulness — and conflating them wastes months chasing the wrong cause.
A RAG system fails in two distinct ways and they look identical to a user: the response is wrong. But the cause is entirely different. Retrieval failure means the relevant context was never fetched — wrong chunks, low precision, poor recall at the top-K. Generation failure means the context was fetched correctly but the model ignored it, misused it, or hallucinated on top of it. If you only test the final output, you can spend weeks improving retrieval when the actual problem is a prompt that does not instruct the model to use the retrieved context.
Test retrieval independently with standard information retrieval metrics. Precision at K measures whether the retrieved chunks are relevant. Recall measures whether all relevant chunks were retrieved. MRR (mean reciprocal rank) measures whether the most relevant chunk appears high enough to influence the response. You need a test set of queries with documented ground-truth relevant documents — built from real user traffic — to measure these accurately. Without that test set, you are optimising blind.
Test faithfulness separately from retrieval quality. Faithfulness asks: given this set of retrieved chunks, does the generated response actually use them? A model can produce a correct-sounding response that has nothing to do with the retrieved context — it simply draws on training data instead. Testing faithfulness requires checking that specific claims in the response are entailed by specific retrieved chunks, not just that the response sounds plausible given the query. This is an LLM-as-judge task: instruct a judge to identify which claims in the response are grounded in which retrieved chunks, and flag any that are not.
Chunk size, embedding model choice, and reranking configuration all have measurable effects on retrieval quality. Do not tune these by intuition — build a retrieval eval pipeline and measure. The pipeline should run automatically on every configuration change. Changes to chunk size or embedding model can silently degrade retrieval for a specific segment of queries while improving the average, which is invisible without per-query-type analysis. Segment your eval set by query category and check all segments, not just the aggregate.
Testing agent workflows
Agents take multiple steps and accumulate state — a failure mid-flow corrupts everything downstream, and reproducing it is harder because each run takes a different path.
Agent testing has a scope problem that single-turn testing does not. An agent that takes ten steps before failing has nine steps of accumulated state to consider. End-to-end tests catch that something went wrong but rarely tell you which step failed or why. Step-level evaluation — instrumenting each tool call, each reasoning step, each state transition — is the foundation of debuggable agent testing. Without it you are left running the agent again and hoping it takes the same path, which it often will not.
Tool-call argument correctness is a concrete, automatable check that most teams neglect. When an agent calls a search tool, you can verify that the query argument matches the intent extracted from the prior context. When it calls a database mutation, you can verify that the record ID is consistent with earlier retrieval. These are deterministic checks on the agent's external interface, and they do not require re-running the full end-to-end flow. Build a trace schema that captures tool-call arguments and add assertions against it.
State corruption between steps is the failure mode that causes the most subtle production bugs. An agent that retrieves a user ID in step 2, fails to store it correctly, and then generates a response in step 8 using a stale or incorrect ID will produce output that is contextually wrong but locally plausible — the kind of bug that passes end-to-end tests but produces support tickets. Test state transitions explicitly: after each step that modifies state, assert that the state is consistent with what the step should have produced.
Refusals and escape hatches are the agentic equivalent of error handling. A well-designed agent should recognise when it is uncertain, when a tool call fails, and when the task is outside its capabilities — and respond appropriately rather than charging ahead with incorrect assumptions. Test these paths deliberately: inject tool failures, provide ambiguous inputs, present the agent with tasks just outside its stated scope. An agent that fails gracefully is vastly more maintainable than one that succeeds until it doesn't.
Production observability for AI features
Latency and error rate stay green while hallucination rate triples — production observability for AI requires a different instrument panel from standard APM.
Traditional APM tells you whether your service is up, slow, or erroring. AI features can be up, fast, and returning HTTP 200 while producing wrong answers at an increasing rate. Hallucination rate, refusal rate, fallback rate, and output quality score are the signals that matter — and none of them appear in a standard error-rate dashboard. Instrumentation for AI features requires capturing input, output, and quality signals at the LLM call level, not just at the HTTP response level.
LLM call instrumentation should capture at minimum: the full input (prompt, retrieved context, conversation history), the full output, latency, token counts, cost, and model version. This is more data than you want to store forever, so build a retention and sampling policy from the start. High-volume production features typically need 100% tracing for recent traffic and sampled tracing for historical analysis, with automatic deletion after a defined period to manage cost and compliance.
Quality drift is the signal most teams miss until it has already affected users. A gradual degradation in output quality — maybe because the model provider silently updated a checkpoint, maybe because user query patterns shifted — looks like noise in any individual metric but is visible as a trend. Set rolling-window alert thresholds on quality metrics, not just point-in-time thresholds. A quality score that drops from 0.92 to 0.85 over two weeks is a production incident that started two weeks ago.
The feedback loop from production traces to golden dataset is where observability creates compounding value. When production monitoring flags a cluster of low-quality outputs, the trace data tells you the exact inputs and context that produced them. Those become candidates for the golden dataset — real, production-sourced failure cases that will catch the regression if it recurs. Without this loop, your eval dataset slowly drifts away from the distribution that actually matters in production.
// Read more