Flaky Test Detection and Analysis

9 min read

A flaky test is one that passes sometimes and fails sometimes with no code change between runs. In a mature suite, flakes are the single biggest productivity killer: they erode trust ("just re-run it"), waste CI minutes, and mask real bugs by making everyone numb to red builds. AI doesn't make flakes go away, but it dramatically shortens the loop from "this test is flaky" to "here's the actual fix." The traditional approach — an engineer spends an afternoon debugging each one — is no longer the only option.

What flake detection actually means

Detection is the easy part: if a test passes and fails on the same commit, it's flaky. CI systems and dedicated tools track this automatically. The hard part is the next step — figuring out why it's flaky and what to change.

Tools for detection

  • Datadog CI Visibility. Detects flakes across runs, surfaces failure patterns, and ties test runs to deploys.
  • Buildkite Test Analytics. Flake tracking with built-in quarantine workflows.
  • Spinnaker Flake Detective. Categorises flakes by likely cause.
  • GitHub Actions, GitLab CI. Both ship native flake detection in recent versions — start here before adopting a dedicated tool.

These tools are pattern-matchers. They tell you which tests are flaky and how often, plus correlate with metadata: time of day, build runner, recent deploys. Useful information, not a fix.

AI-driven root-cause analysis

This is where the productivity step-change happens. Once a flake is identified, paste the relevant artefacts into Claude or ChatGPT and ask for a ranked list of likely causes.

Test: tests/checkout/payment.spec.ts
Flake rate: 15% over the past 100 runs
 
Failed runs share these characteristics:
- Mostly fail in the "select payment method" step
- When they fail, the failure happens 12-18 seconds in
- Browser console shows: "Stripe iframe not loaded"
 
Test code: [paste]
Page object: [paste]
 
What's likely going wrong, and how should I fix it? Rank causes by likelihood.

A typical AI response: "The Stripe iframe loads asynchronously after the parent page is interactive. Your test isn't waiting for the iframe to be ready before clicking. The 12-18 second window suggests a race between your click and the iframe's network load. Add await page.waitForSelector('iframe[name=stripe]') before interacting, or switch to frameLocator() which waits implicitly. As a fallback, retry once on iframe-related errors."

Hours of manual debugging, compressed into a paragraph. The model is wrong sometimes — but even when it's wrong, it narrows the search space.

A practical flake-fix workflow

Step 1 of 5

Detect

CI flags a test as flaky based on pass/fail history on the same commit. Note frequency, timing, and any failure-mode pattern.

The verification step is the part most often skipped, and the part that distinguishes "I think I fixed it" from "it's actually fixed."

Common flake causes the AI catches well

  • Async/timing races. Missing waits, wrong waits, waits on the wrong element.
  • iframe and shadow-DOM access. Especially on Stripe, reCAPTCHA, embedded payment widgets.
  • Test ordering dependencies. Tests that share state and only fail when run in a specific order.
  • Test data pollution. Tests that depend on data that another test creates or deletes.
  • Animation timing. Clicks landing before a transition finishes.
  • Network flakiness. Real network calls in tests that should be mocked.

What AI handles less well

  • Browser-specific bugs. "This test fails 5% of the time on Firefox 119 specifically" — the model doesn't have detailed knowledge of every browser quirk.
  • Backend race conditions. When the flake is a real backend bug, AI may suggest test-side fixes that mask the real issue.
  • Infra-level flakes. CI runner resource contention, DNS hiccups, third-party service blips. AI's suggestions tend to be local; these are systemic.

The quarantine pattern

While you're fixing flakes, don't let them block CI. A quarantine workflow:

  • Move flaky tests to a @flaky-tagged or separate suite.
  • Failures in the quarantine suite don't block PRs — the team sees them, but the build stays green.
  • A weekly cadence reviews the quarantine: fix the test, delete it, or escalate.
  • Tests stay in quarantine for at most 2-3 weeks. Older than that, they get deleted — a flaky test that nobody fixes is providing negative value.

The discipline matters. Quarantine without a fix-or-delete cadence becomes a graveyard of ignored failures.

A note on retries

Most CI runners support automatic retries on failure. Used carefully, this masks transient infra flakes; used carelessly, it hides real flakes that you should be fixing. A rule of thumb: at most one retry, only on the failure step (not the whole test), and always log when a retry succeeded so you can tell flake-on-retry from clean pass.

For deeper context on flaky tests in specific frameworks, see the relevant lessons in the Cypress and Playwright courses.

⚠️ Common Mistakes

  • Re-running until green. Re-running a flake doesn't fix anything; it just delays the moment a real bug slips through. Treat re-run-to-green as technical debt with interest.
  • Not capturing artefacts before re-run. Once the test passes, the failure logs are gone. Configure CI to retain HAR files, screenshots, and console output for failed runs.
  • Trusting the AI's first hypothesis without verification. The 50-run verification step exists precisely because plausible-sounding fixes often don't actually fix the underlying issue.
  • Letting quarantine grow. A quarantine list that exceeds 20 tests is a sign the team has stopped fixing — drop tests rather than letting them rot.

🎯 Practice Task

60 minutes. Use a real flaky test if you have one; otherwise simulate one (e.g., add a setTimeout race to a Playwright test).

  1. Identify a flaky test or create one.
  2. Capture artefacts — test code, page object, three or four failure logs.
  3. Paste into Claude or ChatGPT and ask for a ranked list of likely causes with reasoning.
  4. Apply the top hypothesis. Re-run 50 times.
  5. If it's still flaky, re-prompt with the new evidence. Note how the second-round answer differs from the first.

Next lesson: extending the same approach to bug triage and root-cause analysis at scale.

// tip to track lessons you complete and pick up where you left off across devices.