AI-generated automation scripts

12 min read · Reviewed May 2026 · scriptingscore: partial — vendor surface is volatile; reviewed monthly

Sixty-three per cent of practitioners in the World Quality Report 2025-26 cite AI-assisted test authoring as their most common GenAI application in QA — the highest-ranked use case in the survey. The premise is solid: AI coding agents writing Playwright and Cypress tests is not experimental. The interesting question is which agent fits which workflow, and where the cliff is — the point where "saves you 30 minutes" tips into "costs you 3 hours of debugging". That boundary is well-documented after six months of daily use in production automation teams, and it is more predictable than most teams expect.

READ TIME12 min

DIFFICULTYintermediate

REVIEWEDMay 2026

YOU'LL LEARNWhere AI coding agents reliably win at test authoring, where they fail, and what 6 months of daily use teaches you about the boundary.

The shape of the workflow

Five steps from user story to CI signal — the agent's contribution is the middle three.

AI-generated test authoring is not a single-step operation. It sits inside a workflow where a human provides context at the start and verification at the end. The agent handles the middle: reading the repository, drafting the script, suggesting selectors. The human determines whether the resulting test actually verifies the right behaviour — a decision that cannot be delegated.

The most reliable workflow starts with a user story or ticket, gives the agent access to the repository so it can read existing test patterns and selector conventions, and produces a draft that an engineer reviews before it enters CI. Skipping the human review step is the most common source of tests that pass every run but verify nothing meaningful.

User story to CI run — the canonical workflow

Vendor landscape, May 2026

Four agents, four distinct positioning decisions — the choice depends on where your team lives, not which model is most capable.

The AI coding agent space has four production-grade options for automation teams in 2026. They differ less in raw model capability than in surface area — where the agent lives, what context it can read, and what surrounding workflow it drops into. Picking the wrong surface is more disruptive than picking a slightly weaker model.

Claude Code (claude.com/product/claude-code) is terminal-first and MCP-native. It reads the full repository on invocation and respects existing project conventions when given example tests as context. For Playwright teams working command-line-first — running against a local browser, operating inside a CI container, or chaining with other MCP servers — this is the natural fit. Selector quality is consistently strong when the agent receives one or two existing tests as examples before the new scenario is described.

Cursor (cursor.com) is a VS Code fork with a fully integrated AI composer. Background cloud agents handle longer multi-file tasks while the Composer panel covers inline edits. For teams who prefer IDE-side authoring over a terminal session, Cursor reduces context-switching with no separate tool to install. The trade-off is that it is an IDE migration rather than an add-on to an existing setup.

GitHub Copilot Coding Agent (GA since September 2025) takes the assigned-issue-to-draft-PR approach: assign a GitHub issue to Copilot, and it opens a pull request with the test authored and linked to the original issue. It reads repository conventions during the run. The Copilot CLI agent (public preview as of May 2026) adds a terminal surface for teams who need to operate outside the GitHub Actions boundary.

Devin (cognition.ai/devin) is the autonomous end of the spectrum: it takes a ticket and produces a complete PR with tests, running Devin 2.0 with code-search and auto-indexed wiki capabilities. It is not a pair-programming assistant but a delegated agent. Pricing starts at $500/month and Cognition raised at a $25B valuation in April 2026, signalling the company is not slowing down. Best fit: well-funded teams willing to delegate end-to-end on well-defined, scoped tickets.

	Surface	Repo-context handling	Selector quality	Best fit
Claude Code	Terminal / MCP-native	●Reads full repo on invocation; respects existing patterns when given examples	Strong when primed with existing tests	Terminal-comfortable teams, MCP-pipeline workflows
Cursor	IDE (VS Code fork)	Composer reads open files + repo; background agents for larger tasks	Good; improves quickly with inline corrections	Teams wanting AI pair-programming in a familiar editor
GitHub Copilot Coding Agent	GitHub Actions + CLI preview	Reads repo conventions on issue assignment; matches established patterns	Matches repo patterns; limited async-timing handling	Teams on GitHub Enterprise with established test conventions
Devin (cognition.ai/devin)	Autonomous agent (SaaS)	Full repo + auto-indexed wiki; Devin 2.0 code search	Variable; requires scoped task framing for best results	Teams delegating end-to-end on well-defined tickets

AI coding agents for test authoring, May 2026

Where they reliably win

Boilerplate-heavy, pattern-consistent authoring tasks — the shape of work where AI is net-positive every time.

After six months of daily use, four categories of authoring task stand out as reliably net-positive regardless of which agent you are using:

Selector suggestions from screenshots or HTML. Give the agent a page screenshot or raw DOM fragment, ask it to generate Playwright locators with a fallback hierarchy, and the output is typically faster and more thorough than writing them by hand. This is the narrowest, most reliable win in the entire AI test-authoring space.

Boilerplate scaffolding — fixtures, page objects, beforeEach setup. When the repository already has established examples, the agent matches the pattern with high fidelity. A new feature means new fixtures and page-object additions; this is the authoring cost that AI eliminates most consistently.

Framework translation. Converting an existing Cypress test to Playwright (or the reverse) is mechanical enough that agents handle it almost perfectly. The semantic content of the test transfers; only the API mapping changes, and that mapping is well-represented in training data.

Convention-matching across a large suite. Given two or three existing tests as examples plus a description of a new scenario, the agent produces a fourth test that matches naming, structure, and assertion style. This is the "house style" benefit — it takes months for a new team member to absorb, and seconds for the agent.

Where they reliably fail

Async timing, business-logic assertions, and shifting DOMs — three failure categories that compound each other.

Async timing logic is the most common failure category. Agents add `waitForTimeout(2000)` or similar arbitrary delays where the correct solution is an explicit wait condition — `page.waitForSelector`, `expect(locator).toBeVisible()`, or an event-based wait. The resulting test passes in fast environments and becomes flaky under load or in CI containers with slower rendering. Debugging this flake costs more time than writing the test by hand would have taken.

Business-logic assertions are the deeper problem. The agent can produce a test that calls every API endpoint and asserts every response code. It cannot tell you whether the right data should be in the response — that requires domain knowledge about what your application is supposed to do. A test that verifies the checkout flow makes an API call is not the same as a test that verifies the correct price was charged. Agents produce the former reliably; only humans can specify the latter.

Complex selectors on shifting DOMs are the third failure category. Single-page application components that re-render on state change break agent-generated selectors faster than they break human-written ones, because agents optimise for the current DOM snapshot rather than for selector resilience over time. A `data-testid` added to the element is often the one-line fix — but the agent has to know that data-testid attribute exists.

AI scripting stops being net positive the moment the test starts flaking. Time saved authoring is time spent later debugging — and the debugging cost compounds.

The honest workflow

One discipline that separates net-positive adoption from a growing debt pile: verify every assertion manually before the test joins the suite.

The failure mode that costs teams the most is not a failing test — it is a test that always passes because it does not actually verify anything. An AI agent can produce `expect(result).toBe(result)`, a tautology that is syntactically valid and permanently green. The most important review step on every AI-generated test is not "does this run?" but "can I break the code and make this test fail?"

If you cannot write a simple code change that causes the generated test to fail, the test is not testing. This applies to assertions, selectors, and fixture setup. The five minutes spent validating that a generated test is genuinely sensitive to the thing it claims to cover is the highest-leverage human contribution in the AI-assisted authoring workflow.

// WARNING

Generated tests can pass while testing nothing. assertEquals(produce(x), produce(x)) always passes. The first review pass on every AI-generated test is "does this assertion actually verify behaviour, or just confirm the happy path runs?" If you cannot break the code and make the test fail, the test is not testing.