Fixing Flaky Tests — Claude Code for QA

Flaky tests — passing sometimes, failing sometimes, no obvious cause — erode trust in a test suite faster than almost anything else. Engineers start ignoring failures. CI becomes noise. The suite that was supposed to catch regressions stops being taken seriously. Claude Code can investigate flake patterns and propose targeted fixes, but the investigation has to be systematic.

Gather evidence before asking for a fix

Flake is almost never obvious from a single failure log. You need the pattern across multiple runs. Before opening Claude Code, collect:

Multiple failure logs — three or four distinct instances with different error messages if the flake presents inconsistently
Pass/fail rate — how often it fails (15%? 60%?)
Context — does it fail more in CI than locally? At specific times? After specific tests?

Then structure the investigation:

tests/cart/discount.spec.ts > applies promotional discount fails ~20% of CI runs.
 
Here are three failure logs from this week:
[Log 1 — "expected 'Total: $99' but got 'Total: $0'"]
[Log 2 — "expected 'Total: $99' but got 'Total: $0'"]
[Log 3 — "TimeoutError waiting for locator getByText('PROMO20 applied')"]
 
Read the test file. What patterns do you see across these failures?
Do not propose a fix yet — identify the root cause first.

Common flake root causes Claude spots well

Missing waits before assertions. The most common cause. The test asserts a total immediately after clicking "Apply discount", but the cart total updates after an async API call resolves. The fix is await expect(page.getByText('Total: $99')).toBeVisible() rather than a plain synchronous assertion.

Test order dependence. A test passes in isolation but fails when run after a specific other test. Claude reads the beforeEach/afterEach blocks, spots that a prior test modifies shared state (a cookie, a localStorage entry, a database record), and identifies the missing cleanup.

Animation and transition timing. A button is technically in the DOM but not yet interactable because a CSS transition is still running. Claude spots page.waitForTimeout(500) anti-patterns in your suite and replaces them with waitFor on the element's actionable state.

Short timeouts in slow environments. Your test has a 5-second timeout for an element that takes 3 seconds on a fast machine but 6 seconds on a CI runner. Claude identifies the timing-sensitive assertion and recommends either a longer explicit timeout or a wait condition that is not time-based.

Fixing a confirmed flake

Once the root cause is agreed:

The discount total flake is a race condition — we're asserting before the 
price-update API response lands.
 
In tests/cart/discount.spec.ts:
- After clicking "Apply", wait for the API call to complete before asserting
- Use page.waitForResponse() to wait for the /api/cart/discount endpoint
- Then assert the updated total
 
Make the change and explain what you did.

Claude updates the specific assertion, explains the change, and you verify it before committing.

The quarantine workflow

For tests that keep flaking despite attempted fixes, quarantine rather than delete:

We have 8 tests tagged @flaky in our suite that we haven't fixed yet.
Review them and for each one:
1. Classify the likely root cause (race condition / state pollution / timeout / unknown)
2. Estimate difficulty to fix (quick / moderate / needs investigation)
3. Rank by priority: frequency of failure × business importance of the scenario

Claude gives you a prioritised triage list. Fix the high-priority quick wins first. Schedule the harder ones. Delete tests that are low-priority and have never been stable — flaky coverage is worse than no coverage for those scenarios.

The retry trap

Adding automatic retries to a flaky test (retries: 3 in Playwright config) hides the problem without solving it. A test that passes on the third retry is still a flaky test — it is just failing silently. Retries are acceptable as a short-term quarantine measure while you investigate root cause, never as a permanent fix.

Step 1 of 6

Collect evidence

Gather 3–4 distinct failure logs and the pass/fail rate. Single logs are not enough to identify a pattern.

⚠️ Common Mistakes

Adding retries instead of fixing. Retries mask flakiness, they do not fix it. Use them as a temporary quarantine measure only — never as a permanent solution.
Fixing flakes without multiple run verification. A test that was failing 20% of the time needs at least 10 clean runs to give you reasonable confidence the fix held. One passing run proves nothing.
Providing only one failure log. Single-log analysis is limited. Three logs showing the same error message strongly point to a race condition. Three logs showing different errors point to state pollution. The pattern is the diagnosis.

🎯 Practice Task

Investigate a flaky test in your suite. 20 minutes.

Identify a test that has failed intermittently in CI in the past month — check your CI run history.
Collect at least two distinct failure logs for it.
Ask Claude Code to read the test file and analyse the failure logs. Request a root cause, not a fix.
Evaluate the hypothesis. If it sounds right, ask Claude to apply the targeted fix.
Note: what evidence was most useful to Claude? What would have been harder to spot without reading the code?

The next lesson covers a closely related problem: tests breaking not because of a bug in the test, but because the UI they were testing changed.