Generating Playwright Tests from Natural Language

9 min read

This is the workflow most teams adopt Playwright MCP for: describe a test in plain English, the assistant drives the browser to verify the flow works, and out comes Playwright TypeScript code you can paste into the suite. The killer detail is that Claude runs the flow first — so the generated code references the locators that actually exist, not the ones it guessed from training data. This lesson covers the prompts that produce mergeable code, the iterative refinement loop, and the review discipline that keeps generated tests from rotting your suite.

The output quality scales with the precision of your prompt. "Write a login test" yields a generic snippet. The structured prompt below yields a test that compiles, runs, and follows your team's conventions. The difference is mostly in what you say up front.

The four-step loop

  1. You describe the test you want — URL, steps, success oracle, conventions.
  2. Claude drives the browser through the flow, verifying it works in your real app.
  3. Claude outputs the equivalent Playwright TypeScript code, using the locators it actually used.
  4. You paste into your test suite, run, review the diff, harden anything that's brittle.

The fact that the assistant verifies the flow before writing code is what separates this from generic LLM code generation. The locators came from a real snapshot. The waits came from real timing. The assertions came from text the assistant actually saw. That grounding is the whole reason this works.

A prompt that produces mergeable code

Generate a Playwright test for this scenario:
 
- Visit https://demo.myshop.com/login
- Log in with user "demo@test.com" and password "demo123"
- Verify we land on /dashboard
- Verify the welcome message contains "Demo User"
 
Conventions:
- Playwright Test framework, TypeScript
- Page Object Model with a LoginPage class
- Prefer getByRole, getByLabel, getByText (in that order). No CSS selectors unless nothing else exists.
- No waitForTimeout. Use web-first assertions for synchronisation.
- Read credentials from process.env (E2E_USER, E2E_PASSWORD).

The assistant runs the flow, then emits something like:

import { test, expect, Page } from '@playwright/test';
 
class LoginPage {
  constructor(private page: Page) {}
 
  async navigate() {
    await this.page.goto('/login');
  }
 
  async login(email: string, password: string) {
    await this.page.getByLabel('Email').fill(email);
    await this.page.getByLabel('Password').fill(password);
    await this.page.getByRole('button', { name: 'Sign in' }).click();
  }
}
 
test('user can log in successfully', async ({ page }) => {
  const loginPage = new LoginPage(page);
  await loginPage.navigate();
  await loginPage.login(
    process.env.E2E_USER ?? 'demo@test.com',
    process.env.E2E_PASSWORD ?? 'demo123',
  );
 
  await expect(page).toHaveURL(/\/dashboard/);
  await expect(page.getByText(/Demo User/)).toBeVisible();
});

Notice what the prompt got you: a real LoginPage class, role/label-based locators, web-first assertions, and process.env for credentials. None of it was free — every line is the direct consequence of a constraint you stated.

The dials that move quality the most

  • Locator preference. "Use getByRole, getByLabel, and getByText, no CSS selectors" eliminates the brittle [class*="btn-primary"] patterns the model otherwise reaches for. You're trading a sentence in the prompt for a much cleaner test long-term.
  • Synchronisation strategy. "No waitForTimeout. Use web-first assertions." removes the single biggest source of flake from generated code in one stroke. Be explicit about what to wait for, never how long.
  • Project conventions. "Page Object Model" / "functional helpers" / "flat tests, no abstractions yet" anchors the structure to whatever your team has already standardised on. See the Playwright with TypeScript course's POM lesson for the pattern this course assumes.
  • Test data. "Read credentials from process.env" / "use the test fixtures in tests/fixtures/" prevents the model from hard-coding production-looking creds. Specify the source of truth and it follows.
  • One example file. Pasting in a single existing test from your repo as "match this style" anchors imports, fixture usage, and naming far better than any list of rules.

Iterative refinement is the norm

The first draft is almost never the merged version. A good follow-up:

Refactor the test:
- Extract the dashboard assertions into a verifyDashboard() method on a new DashboardPage class.
- Add an afterEach hook that calls page.context().clearCookies().
- Move the credential constants into a fixtures/users.ts file and import from there.

The assistant restructures, the diff is small, you review and merge. Treat each pass like a code review with a fast collaborator: state what you want changed, expect a tight response, repeat until clean.

The four-step loop in practice

Step 1 of 5

You describe the test

Plain English: starting URL, steps, the success oracle, and the conventions to follow (locator strategy, framework, fixture style).

What to look for in review

Generated tests are drafts, not finished work. Three things to check on every paste:

  • Does the assertion really test what you wanted? "Welcome" is visible — is that the success state, or just a string that happens to render on a half-broken page? Tighten the oracle until a real failure would actually fail the test.
  • Are there any waitForTimeout calls hiding in here? Replace with await expect(...).toBeVisible() or await page.waitForResponse(...). Fixed waits will go flaky.
  • Are the locators stable across data changes? A getByText("Order #12345") works on the run that created order 12345 and breaks on every later run. Parameterise or pivot to a stable testid.

A two-minute review converts an AI draft into a real test. Skipping it converts your suite into technical debt.

What this gives back to your existing course knowledge

Everything you learned in the Playwright with TypeScript course still applies — the locator strategy, the fixtures, the POM patterns. AI-generated tests are Playwright tests. The course remains the source of truth for what good looks like; this lesson just speeds up authoring the first draft.

⚠️ Common mistakes

  • Pasting AI-generated tests without running them locally. Sometimes the assistant misremembers an import or assumes a fixture you don't have. A 30-second local run catches this; a CI failure on the next push doesn't. Run before you commit, every time.
  • Letting the assistant invent test data. "Use realistic test data" in a prompt produces emails, names, and prices the model fabricated. They look right but they don't exist in your seeded test database. Always specify a known fixture user, or pass process.env references the model can fill from your CI secrets.
  • Generating one mega-test that covers ten unrelated assertions. AI tends to over-pack a single test() block when you describe a whole feature. Ask explicitly for one test per behaviour, with shared setup in a fixture. Small tests fail with sharp signal; big tests fail vaguely.

🎯 Practice task

Generate a real test, harden it, and merge it. 30 minutes.

  1. Pick a flow on your staging app that doesn't already have a Playwright test — something boring is fine: log in, change a setting, log out.
  2. Write a precise prompt using the structure above (URL, steps, oracle, conventions). Include "prefer getByRole/getByLabel/getByText, no waitForTimeout, POM with one Page class per page touched".
  3. Run the prompt. Read the tool calls Claude makes — verify it actually drove the flow before generating code. Save the output.
  4. Paste into tests/ai-generated/ in your repo. Run npx playwright test --headed. Diff the result against your team's existing tests — does it match the style? If not, send a follow-up prompt to refactor toward your conventions.
  5. Stretch: parameterise the test using test.each or a for loop over a fixture array. Ask the assistant: "Refactor to run the same flow for three different user roles (admin, editor, viewer). Read the role list from tests/fixtures/users.ts." Confirm the refactor still passes for all three.
  6. Once stable across three local runs, open a PR. In the description, note that the test was AI-generated and reviewed — useful signal for reviewers and for your future-self when looking at suite history.

The next lesson handles the inverse direction: instead of describing the flow, you demonstrate it in the browser and the assistant transcribes it.

// tip to track lessons you complete and pick up where you left off across devices.