Agent observability

AI & LLM Testing

// Definition

Instrumentation and tooling that makes the behaviour of an AI agent debuggable in production. A multi-step agent that fails mid-flow leaves a different kind of evidence than a crashed service: there's a tool-call trace, an LLM reasoning chain, a sequence of page snapshots, a token-and-cost ledger. Agent observability platforms — Laminar, Langfuse, Arize Phoenix, LangSmith, Braintrust — capture this and make it queryable. The distinction from regular APM is the unit of analysis: traditional observability shows you the request that failed, agent observability shows you the decision that was wrong. The hardest signal to capture cleanly is whether a failure was application flakiness or LLM context failure — those look identical in a trace but require different fixes.

// Related terms

Model Context Protocol (MCP)
An open standard introduced by Anthropic in late 2024 that lets AI assistants connect to external tools and data sources through a uniform JSON-RPC interface. An MCP server exposes tools (callable functions), resources (readable data), and prompts (templates) to any MCP-compatible host (Claude Desktop, Claude Code, IDE plugins). Build a server once and any compliant client can use it — the protocol is model-agnostic, which makes integrations portable across AI providers.
Eval harness
Software that runs an LLM-backed system against a dataset of inputs, scores the outputs against criteria (exact match, similarity, LLM-as-judge, custom rubric), and tracks how scores change across model versions, prompts, or code changes. Eval harnesses are to AI features what test runners are to deterministic code: the place CI calls into, the place regressions get caught, the place quality is measured rather than asserted. The 2026 ecosystem has fragmented rather than consolidated — Braintrust is eval-first, Langfuse is prompt-first (acquired by Clickhouse in January), Laminar is built for agent debugging, Arize Phoenix is OpenTelemetry-native. Most teams pick one platform per workflow rather than expecting one tool to cover everything.