Agentic testing: what it is, what it isn't
Two activities both get called "AI testing" and they are not the same. An AI agent that drives a real browser session — observing the page, deciding what to click, iterating toward a goal — is doing testing work. An AI coding assistant that generates test code is helping you do testing work. The architectural choices, failure modes, and ROI questions are different for each, and confusing them leads teams to evaluate the wrong thing and buy the wrong tools.
Agentic vs assistive AI in QA
The critical distinction is whether AI is in the execution loop or outside it — that determines the failure modes and the architecture.
Assistive AI in QA sits outside the execution loop. GitHub Copilot suggesting a test case, Claude generating a page object, Cursor writing a fixture — in all of these, a human reviews and accepts the output before it runs. The failure mode is a bad suggestion that a human catches and rejects. The blast radius is low: the developer sees the suggestion, evaluates it, and discards it if it is wrong. The ROI question is productivity: does it help engineers write tests faster and with better coverage?
Agentic AI in QA sits inside the execution loop. The agent inspects the live browser, decides what action to take, executes it, observes the result, and iterates — all without a human in the loop for each step. The failure mode is a confident agent doing the wrong thing at scale: misidentifying a bug that does not exist, missing a regression because it navigated the wrong path, or burning through token budget on an unrecoverable state. The blast radius is wider and the failure is often less visible than a flaky scripted test.
This is not an argument against agentic testing — it is an argument for understanding which you are building. Most teams benefit from both: assistive AI for test authoring and maintenance, agentic AI for specific workloads where script brittleness has become the dominant cost. The tools are different, the evaluation criteria are different, and the organisational readiness requirements are different. Start with clarity about which problem you are solving before evaluating products.
The 200-test readiness floor
Practitioner consensus in 2026 is that agentic testing pays off above roughly 200 stable tests — below that, the integration overhead exceeds the maintenance savings.
The figure of ~200 stable tests as the readiness floor comes from practitioner experience, not vendor marketing. Below that number, the integration cost of setting up agent infrastructure, dealing with non-deterministic agent runs in CI, and building the observability to tell whether agent failures are flakiness or real regressions typically exceeds the maintenance savings from not having to maintain brittle locators. This is not a universal law — teams with unusual locator brittleness may find the break-even lower — but it is a useful prior.
"Stable" does the real work in that sentence. A suite of 200 tests where 40 are permanently skipped or consistently flaky is not a suite of 200 stable tests. Stable means deterministic locators (not XPath that breaks on UI refactors), normalised CI (tests that pass reliably when run on the CI infrastructure, not just locally), and standardised reporting that distinguishes genuine failures from infrastructure noise. If your test suite does not have these properties, adding agentic testing on top adds a second layer of non-determinism on top of the first, which makes debugging significantly harder.
Below the floor, the better investment is usually stabilising and expanding the scripted suite. Deterministic locators, proper fixture management, reduced flakiness — these compound. A 200-test stable suite with clean CI is a better foundation for agentic testing than a 400-test suite where 150 tests are red for reasons no one has investigated. Agentic testing amplifies the quality of the existing suite; it does not substitute for it.
What agentic testing actually buys you
Resilience to UI change and exploratory coverage that scripted tests cannot reach — at the cost of token spend, non-determinism in runs, and a new class of observability problem.
The primary benefit of agentic testing is resilience to UI change. A scripted test that clicks a button by its CSS class breaks when the class changes. An agent that has been told to "click the checkout button" finds the button by reasoning about the page content, even if the class, position, or surrounding markup has changed. For products with frequent UI iteration, this translates directly into reduced test maintenance overhead — which is the dominant engineering cost in many mature test suites.
Exploratory coverage is the second benefit and the harder one to measure. An agent given a high-level goal will often navigate paths that a script never covers, because scripts are written by humans who imagine the happy path and the known edge cases. An agent discovering the interface fresh may find states and transitions that escaped the script author. Whether those discovered paths surface real bugs depends on whether the agent has meaningful assertions — a goal of "complete checkout" with no assertion on the final state is exploration without testing.
The cost side deserves honesty. Token spend for agent-driven test runs is not trivial: a DOM-driven agent running a moderately complex user journey at Playwright MCP token rates can cost $0.10-0.50 per run depending on the model and the number of steps. At 1,000 CI runs per day across a large suite, that is a material line item. Non-determinism in agent runs creates a new class of CI failure: did the agent fail because the product regressed, or because it navigated a different path this run? Distinguishing those two requires the same observability investment you would make for any non-deterministic test layer.
// Read more