Flaky Test Debugging

Diagnose and fix a set of provided flaky Playwright tests: identify root causes (timing, shared state, selector fragility), apply fixes, and document your reasoning.

Role

Automation QA engineer

Difficulty

Intermediate

Time limit

60–90 min

Scenario

You have inherited a Playwright + TypeScript test suite for a web application. The CI pipeline has been consistently marking builds as unstable — not because of real application defects, but because several tests pass and fail on alternating runs without any code change. The previous team tagged these tests with @flaky and disabled retries to make the instability visible rather than hide it. Your task is to read each failing test, identify the specific root cause, apply the correct fix, and write a clear diagnosis for each one. You will also produce a short prevention guide addressed to the engineering team so the patterns that caused this flakiness do not recur.

Requirements

1.For each of the five provided flaky tests, state the specific root cause in one sentence — name the failure mode (e.g. 'race condition between a debounced DOM update and the assertion', 'shared mutable storageState fixture across sibling tests') rather than a vague 'timing issue'
2.Apply the correct fix for each test — the fix must address the stated root cause, not mask it: use explicit waits with a specific condition, isolate test state per test, or make assertions deterministic rather than position-dependent
3.Do not introduce blanket retry logic (--retries N or test.retry()) as the primary fix for any individual test; if retries are mentioned, they must be framed as a CI safety net only, not a substitute for root-cause resolution
4.Cover at least three distinct flakiness categories across your five fixes: timing/wait issues, shared state or test ordering, and selector or data non-determinism
5.For each fix, write two to three sentences explaining why the original code was flaky and why your change makes the test deterministic
6.Verify that every test in the suite creates its own data and does not rely on state left by a preceding test — document any additional isolation changes needed beyond the five targeted fixes
7.Write a one-page flaky-test prevention guide that names the specific anti-patterns found in this suite and the concrete practices the team should adopt; the guide should be written as if addressed to the full engineering team, not just QA

Starter data

›auth.spec.ts — testUserCanViewOrders: fails intermittently on CI with 'Element not found: [data-testid="order-list"]'; passes consistently on a local developer machine. The test logs in via the UI and immediately asserts that the order list is visible.
›checkout.spec.ts — testTotalUpdatesOnQtyChange: fails roughly 1 in 3 CI runs with a stale value assertion — the computed total shown on screen does not match the expected value. The test clicks the quantity increment button and reads the total text in the very next line.
›product.spec.ts — testSortByPriceAscending: fails intermittently when the first product's price does not match the expected value '£4.99'. The test clicks 'Sort: Price Low–High' and then reads products[0].innerText.
›admin.spec.ts — testBulkDeleteClearsTable: occasionally leaves 1–2 rows in the table after the delete action. The test selects all row checkboxes, clicks the Delete button, and immediately asserts that the table row count is 0.
›user.spec.ts — testProfileUpdatePersists: always passes in isolation but fails when run as part of the full suite. The test updates the logged-in user's display name and asserts the new name is visible. The file shares a loggedInUser storageState fixture with a sibling test that also mutates user profile data.

Expected deliverables

✓A per-test diagnosis document covering all five flaky tests — each entry must include: the root cause (one sentence), the code change applied, and a two-to-three sentence explanation of why the fix makes the test deterministic
✓The corrected test code as a diff or updated file, with inline comments where the original flaky pattern was replaced
✓Evidence of at least three consecutive passing runs for each fixed test — a CI screenshot, a terminal output excerpt, or a local test run log is acceptable
✓A written flaky-test prevention guide (one page or equivalent) addressed to the engineering team, naming the specific anti-patterns found in this suite and the practices that prevent them

Evaluation rubric

Dimension	What reviewers look for
Root-cause identification	Does the candidate name the specific failure mode for each test — race condition, shared mutable fixture, non-deterministic ordering, missing network wait — rather than a blanket 'timing issue'? A diagnosis that says 'it was slow' without naming the mechanism scores poorly.
Correctness of fix	Does each fix target the stated root cause? Are Playwright's built-in waiting mechanisms (expect().toBeVisible(), waitForResponse, waitForURL) used in preference to fixed timeouts? Does the fix for test 3 (sort) make the assertion deterministic rather than just waiting longer?
No masking	Does the candidate avoid using waitForTimeout or blanket --retries as the primary resolution? If retries are mentioned at all, are they framed as a temporary CI signal rather than a fix? Submitting a solution where every test gains a 2-second sleep scores a failing mark on this dimension regardless of how well the rest of the report reads.
Test isolation	Does the fix for test 5 (shared storageState) create independent browser context or fixture state per test rather than reordering tests? Does the candidate check that the other four tests are also free of cross-test data dependencies, or explicitly note that they are already isolated?
Prevention strategy	Does the prevention guide name the specific patterns found in this suite (immediate DOM assertions after async triggers, shared mutable fixtures, positional selectors over deterministic ones) and map each to a concrete team practice? Generic advice ('write stable tests') does not satisfy this dimension.
Clarity of communication	Is each diagnosis entry clear enough that a developer who did not write the original test can understand the failure mode and the fix without follow-up questions? Is the prevention guide written for a mixed audience (not assuming deep Playwright knowledge)?

Sample solution outline

›auth.spec.ts — root cause: after login, the server fetches orders asynchronously; CI environments are slower than a developer's machine so the network request is still in flight when the assertion runs. Fix: replace the immediate assertion with `await expect(page.getByTestId('order-list')).toBeVisible()` — Playwright's built-in auto-wait retries the assertion until the timeout, making it resilient to variable response times.
›checkout.spec.ts — root cause: the cart total is recalculated via a debounced JavaScript event; reading the DOM immediately after clicking the increment button captures the pre-update value. Fix: wait for the network request that recalculates the total before asserting — `await Promise.all([page.waitForResponse(r => r.url().includes('/cart/calculate')), incrementButton.click()])` — then read and assert the total.
›product.spec.ts — root cause: the sort is non-deterministic for products with equal prices; two products share the price '£4.99' and their relative order after sorting is not guaranteed. Fix: instead of asserting that products[0].price equals a hardcoded value, assert that every price in the displayed list is greater than or equal to the one before it — `prices.every((p, i) => i === 0 || p >= prices[i - 1])` — making the test correct regardless of tie-breaking order.
›admin.spec.ts — root cause: the Delete button triggers an asynchronous server request; the row-count assertion runs before the server confirms deletion and the DOM re-renders. Fix: intercept the network response before asserting — `await Promise.all([page.waitForResponse(r => r.url().includes('/bulk-delete') && r.status() === 200), deleteButton.click()])` — then assert the empty table.
›user.spec.ts — root cause: the loggedInUser storageState object is shared across all tests in the file; a sibling test mutates the user's display name in the same browser session, leaving the profile in an unexpected state for the next test. Fix: create a fresh BrowserContext per test in a beforeEach hook — `const context = await browser.newContext({ storageState: 'auth.json' })` — so each test starts from a clean, unmodified session.
›Prevention guide key points: (1) always await a specific network response or DOM condition before asserting a value that depends on async work — never read the DOM immediately after a user action that triggers a server call; (2) never share a mutable fixture (storageState, database row, local storage) across sibling tests — each test must own its state; (3) prefer assertions about properties (is the list sorted?) over hardcoded positional values (is item[0] exactly £4.99?); (4) retries are a quarantine mechanism — any test retried more than once in 10 runs should be filed as a defect and fixed within one sprint.

Common mistakes

Adding `--retries 2` in playwright.config.ts and closing the ticket — retries suppress the failure signal from the CI dashboard but do nothing to fix the race condition; the test still consumes 3× the CI time on each flaky run and the root cause remains
Replacing every assertion with `await page.waitForTimeout(2000)` before it — a fixed sleep that passes on a fast developer machine will fail on a slow CI runner when response times spike; the correct replacement is waiting for a specific, observable condition such as a network response or a DOM state
Diagnosing test 3 (sort by price) as a timing issue and adding a wait for the sort animation to finish — the root cause is non-deterministic ordering among equal-price items, not timing; a longer wait does not make the positional assertion deterministic and the test will still fail for the same data reason
Fixing test 5 (shared storageState) by reordering the tests so the mutating test runs last — test order is brittle and can change when files are added or parallel workers re-sequence execution; the only robust fix is making each test independent through per-test context isolation
Writing a prevention guide that lists generic rules without connecting them to the specific failures found — 'use explicit waits' is not actionable unless it names the async patterns (debounced events, server-side recalculation, SPA navigation) that demand them in this particular codebase

Submission checklist

A diagnosis entry for each of the five flaky tests: root cause stated in one sentence, fix applied, and a two-to-three sentence explanation
Fixed test code submitted as a diff or updated file with inline comments marking the original flaky pattern
Evidence of at least three consecutive passing runs for each fixed test (CI log, terminal output, or screenshot)
No waitForTimeout or blanket retry count used as the primary fix for any of the five tests
At least three distinct flakiness categories addressed across the five fixes: timing/wait, shared state, and non-deterministic data or ordering
A written prevention guide addressed to the engineering team that names the specific anti-patterns found in this suite

Extension ideas

+Set up a flaky-test quarantine GitHub Actions workflow: if a test fails more than twice in the last five runs on the main branch, automatically add a @flaky tag to the test and open a GitHub issue with the failure logs and a link to the CI run — making the flakiness visible without blocking the build
+Add a custom Playwright reporter that writes per-test pass/fail results to a JSON artefact after each CI run and flags any test whose 30-run pass rate drops below 95% in a weekly summary — giving the team a data-driven signal before a flaky test degrades into a persistent blocker