AI-driven refactoring of test suites

9 min read · Reviewed May 2026 · refactoring

The question most teams reach after the first wave of AI-assisted authoring is: can we use the same tools to improve what is already there? The answer is genuinely mixed. AI coding agents handle a specific set of test-suite refactors cleanly — selector hygiene, helper extraction, naming normalisation — and a different specific set badly: page-object extraction, fixture restructuring, lifecycle-order changes. Knowing which is which before you start is the difference between a one-day win and a three-day debugging session.

READ TIME9 min

DIFFICULTYintermediate

REVIEWEDMay 2026

YOU'LL LEARNWhich test-suite refactors AI handles cleanly, which it breaks subtly, and the review discipline that keeps the process net positive.

The refactor flow

Audit, plan, convert, review — in that order, and never mixed into a single PR.

A sustainable AI-assisted refactor follows four steps: the agent audits the suite for refactor candidates, produces a scoped conversion plan, performs the conversion, and the engineer reviews each change file-by-file before merge. The flow breaks when steps are combined — when the agent converts without auditing first, or when mixed-purpose changes land in a single PR.

The critical constraint is that each PR should address one type of refactor. Selector-hygiene changes in one PR. Helper-function extraction in a second. Naming normalisation in a third. Mixed-purpose PRs are harder to review thoroughly, easier to approve without scrutiny, and harder to revert cleanly if something breaks.

Audit → plan → convert → review, in that order

What lands cleanly

Local, mechanical changes with clear inputs and outputs — the refactors where AI is genuinely net-positive.

Selector-hygiene refactors are the most reliable AI win. The pattern is well-defined: identify all `page.locator()` calls using text or CSS-path selectors and replace them with `getByRole`, `getByTestId`, or `getByLabel` equivalents. The agent reads the selector, infers the likely element from context, and produces a more resilient alternative. Ambiguous cases can be flagged for human review rather than converted speculatively.

Helper-function extraction — identifying duplicated setup code across multiple test files and consolidating it into shared fixtures — is equally mechanical and equally reliable. The agent finds repeated `beforeEach` blocks, extracts them to a fixture file, and updates imports in every affected test. The change is local, the inputs are clear, and the output is verifiable by a single test run.

Naming-consistency refactors are the lowest-risk type. The agent makes character-level substitutions to bring test descriptions, variable names, and function names in line with a stated convention. Tests continue to run unchanged; the only risk is a mismatch between the stated naming convention and what the rest of the codebase expects.

// ❌ Before — fragile text and CSS selectors
await page.locator('text=Add to basket').click();
await expect(page.locator('.cart-item-count')).toHaveText('1');

// ✓ After — resilient role-based and test-id selectors
await page.getByRole('button', { name: 'Add to basket' }).click();
await expect(page.getByTestId('cart-item-count')).toHaveText('1');

Selector hygiene — before and after AI refactor

What breaks subtly

Page-object extraction and fixture restructuring look successful but introduce silent failures — the agent doesn't track dependencies.

Page-object extraction is the refactor most teams attempt first and regret most often. The agent groups test code by file location: all selectors and interactions in `checkout.spec.ts` go into a `CheckoutPage` object. The correct grouping is by domain cohesion: the checkout page object should contain interactions related to the checkout domain regardless of which test files use them. Agent-generated page objects frequently organise by file locality rather than domain, producing one large page object where three focused ones would serve better.

Fixture restructuring is more dangerous because the breakage is non-obvious. The agent moves a setup hook from `beforeEach` scope to a describe-level fixture and the tests continue to compile and run individually. Three tests fail in CI because they depended on the setup hook running in a specific order relative to other setup code — a dependency the agent cannot infer from the code structure alone.

The failure mode in both cases is identical: the agent makes a syntactically valid change that satisfies the stated refactoring goal while breaking a semantic dependency that was implicit in the original code. The original code worked because of ordering, scope, or shared state that was never explicitly documented.

// ❌ Agent refactor — moves beforeEach to describe-level fixture
// Three tests that depend on auth state initialised in the right order break silently
test.describe('checkout flow', () => {
  test.use({ storageState: 'auth.json' }); // agent-introduced — wrong scope for this test

  test('adds item to cart', async ({ page }) => { /* ... */ });
  test('applies discount code', async ({ page }) => { /* fails — auth state not available */ });
});

Fixture restructuring — agent output compiles but breaks test ordering

Whole-suite refactors rarely land first try

Vertical slices — one file at a time with a CI run between each — is the pattern that consistently works.

The pattern that produces working results is vertical rather than horizontal: refactor one test file, run CI, review the changes, merge. Move to the next file. This is slower than asking the agent to convert the entire suite in a single session — and it is the approach that does not produce multi-day debugging sessions.

The temptation is to request a comprehensive sweep: "convert all 200 tests to the new page-object pattern." The agent will produce 200 syntactically valid tests. Thirty to fifty of them will have semantic problems — wrong assertions, implicit state assumptions, timing issues that only manifest under certain conditions — that take hours to diagnose and trace back to the refactor.

// WARNING

Whole-suite migrations rarely land first try. An AI agent that confidently restructures a 200-test suite will produce 200 syntactically valid tests, of which 30–50 will fail in ways that take hours to diagnose. Refactor in vertical slices — one test file at a time, CI run between each — not horizontal sweeps across the whole suite in one PR.