Cost, latency, and caching for agent-driven tests

10 min read · Reviewed May 2026 · economics

Agentic tests can run at $50–200 per day for 10,000 actions before you start tuning, and that is using mid-tier models on DOM-driven stacks. The cost differences between architectures are large enough to be a deciding factor — bigger than reliability differences in many cases. Understanding the economics before you commit to a stack is not optional.

Token economics by stack

DOM-driven stacks cost roughly 4–8x less per action than vision-driven ones — the difference is large enough to determine stack selection for high-volume workloads.

Playwright MCP is the most token-efficient production stack for standard web UIs. Accessibility-tree snapshots are compact structured data — typically 5–20K tokens per page observation depending on UI complexity. A complete agent task covering five pages with three actions each might run 60–100K tokens end-to-end with a frontier model. At Claude 3.5 Sonnet rates (May 2026), that is roughly $0.18–0.30 per task. For 10,000 tasks per day, that is $1,800–3,000 per day before any optimisation.

Anthropic Computer Use operates on screenshots — multimodal inputs that are significantly more expensive to process. Each screenshot observation may run 2,000–5,000 tokens for the image alone, before any reasoning tokens. A task requiring 10 screenshot observations to complete can cost 3–8x more than the same task via DOM-driven approaches. The difference is justified for workloads that DOM cannot reach, but it makes Computer Use unsuitable for high-volume regression testing on standard web UIs.

Stagehand and Browser Use sit between those extremes. Both use accessibility-tree-based approaches similar to Playwright MCP but add overhead for framework abstractions, optional vision fallbacks, and managed infrastructure. In practice, per-task costs run 20–40% higher than raw Playwright MCP for equivalent tasks. Whether that overhead is worth the operational simplicity of managed cloud infrastructure depends on your team's infrastructure capacity and the volume you are running.

Practical estimation methodology: instrument five representative tasks end-to-end with your chosen stack and model, record token counts per task, and extrapolate to daily volume. Do not use vendor-provided estimates — they are typically derived from simple tasks on fast pages, not real production workloads with slow APIs, dynamic content, and login flows.

Model selection within a stack

Routing simpler steps to cheaper models can cut per-run cost by 60–80%, but the quality trade-off is real and requires measurement.

Not every step in an agent workflow requires a frontier model. Navigation steps — "go to the checkout page", "click the submit button" — are mechanically simple and can often be handled correctly by cheaper models. Reasoning steps — "determine whether this form error message indicates a product bug or a validation state that is expected" — genuinely benefit from frontier model capability. The insight that drives model routing is that the cost distribution is highly skewed: a small number of hard steps consume most of the reasoning budget.

The practical approach is a planning loop that uses a cheaper model (Claude 3 Haiku, GPT-4.1-mini, or equivalent) for action selection on straightforward steps and escalates to a frontier model when confidence is low or the step requires genuine reasoning. "Confidence is low" can be operationalised as: multiple candidate actions with similar accessibility-tree scores, a step that has failed twice in the current session, or a step type that the planning loop has flagged as historically unreliable.

The quality cost is real and varies by workload. For highly structured regression suites on well-understood pages, the cheap-model path works surprisingly well — the variance from model capability is much smaller than the variance from page state differences. For exploratory testing or debugging sessions where the agent needs to reason about unexpected state, the cheap model path produces significantly more failures. Measure on your workload before deploying — do not extrapolate from benchmarks that were not designed for your application.

Action caching — the Passmark pattern

Caching agent-discovered actions allows regression suites to run at near-zero LLM cost after the first pass — at the cost of brittleness that scripted tests also have.

The Passmark pattern (named for the QA tooling company that documented it publicly) works as follows: on first run, an agent discovers and executes all actions for a test case, recording exactly what it clicked, filled, and submitted. On subsequent runs, the cached action sequence replays deterministically without calling the LLM. The LLM is only re-invoked when the replay fails — indicating a UI change that requires the agent to re-discover the correct action. This gives you near-zero LLM cost for regression suites that are not actively changing.

The trade-off is the one that deterministic tests always carry: a cached action sequence is brittle in exactly the same ways a scripted test is brittle, just discovered dynamically on first run rather than hand-authored. When the UI changes and the cached action fails, the agent re-discovers — which is the resilience you paid for. But the re-discovery cost is incurred on every run until the cache is updated, which can be expensive if UI churn is high.

The architecture makes most sense for stable regression suites that you run frequently and whose UIs change infrequently. The first pass — building the cache — is the expensive one. Once cached, the suite runs at the cost of Playwright test execution plus the occasional re-discovery event. For a 200-test stable suite on a product with fortnightly UI changes, the economics are compelling. For a product whose UI changes daily, the re-discovery cost may negate most of the savings.

Related glossary terms