Agentic testing

AI & LLM Testing

// Definition

A testing approach where an AI agent — not a pre-written script — drives the test session. You hand the agent a goal in plain English ("complete checkout with a guest account and verify the success modal") and it inspects the page, decides what to do next, executes the action, observes the result, and iterates until the goal is reached or it gives up. The architectural shift from deterministic automation is significant: with scripted tests you know exactly which steps will run, with agentic tests you only know the intent. That's both the appeal (resilient to UI change) and the risk (a confident agent doing the wrong thing at scale costs more than a flaky scripted test). Practitioner consensus in 2026 is that agentic testing pays off above roughly 200 stable tests with mature locator strategies — below that, integration overhead exceeds the maintenance savings.

// Related terms

Model Context Protocol (MCP)
An open standard introduced by Anthropic in late 2024 that lets AI assistants connect to external tools and data sources through a uniform JSON-RPC interface. An MCP server exposes tools (callable functions), resources (readable data), and prompts (templates) to any MCP-compatible host (Claude Desktop, Claude Code, IDE plugins). Build a server once and any compliant client can use it — the protocol is model-agnostic, which makes integrations portable across AI providers.
Large Language Model (LLM)
A neural network trained on massive text datasets to predict the next word in a sequence. Modern LLMs like Claude, GPT-4, and Gemini can answer questions, write code, summarise documents, and follow multi-step instructions — but they don't 'know' anything, they predict plausible continuations from patterns in training data. This is why they sometimes produce confident-sounding falsehoods (hallucinations) and why prompt design matters so much. In QA, LLMs are useful for generating test scaffolding, summarising bug reports, and drafting documentation — but their output always needs human review before it ships.
Hallucination
When an AI model generates output that is fluent, confident, and completely wrong. In QA work this often looks like an LLM inventing a method that doesn't exist on a real API, citing a documentation page that was never written, or producing a test assertion that doesn't actually verify the behaviour described in the prompt. Hallucinations aren't a bug — they're a consequence of how language models work, predicting likely text rather than retrieving facts. The mitigations are: ground the model in real context (paste the actual API spec, not its name), verify generated code by running it, and treat any AI-produced reference (URLs, function names, citations) as untrusted until checked.