Agent observability

AI & LLM Testing

// Definition

Instrumentation and tooling that makes the behaviour of an AI agent debuggable in production. A multi-step agent that fails mid-flow leaves a different kind of evidence than a crashed service: there's a tool-call trace, an LLM reasoning chain, a sequence of page snapshots, a token-and-cost ledger. Agent observability platforms — Laminar, Langfuse, Arize Phoenix, LangSmith, Braintrust — capture this and make it queryable. The distinction from regular APM is the unit of analysis: traditional observability shows you the request that failed, agent observability shows you the decision that was wrong. The hardest signal to capture cleanly is whether a failure was application flakiness or LLM context failure — those look identical in a trace but require different fixes.

// Related terms