The week our flaky-test rate dropped from 18% to 2%

qa.codes · 28 March 2026 · 8 min read

Intermediate

flaky-testscypressci-cd

Our CI was failing 18% of runs to flakes we'd stopped looking at. One week, four changes, no new tests. Here's what we actually did.

The starting state

Two hundred and twelve Cypress tests, a React front-end, a Node API, PostgreSQL in Docker on CI. We'd accumulated the test suite over about eighteen months, with different people writing different sections at different times. Nobody had ever sat down and specifically addressed flakiness.

The CI config had a retries: 2 setting at the top level — "retry flaky tests automatically, up to twice." At first glance this looks like a reasonable safety net. In practice it was a confidence-destroying trap. Tests that retried twice and passed on the third attempt didn't show as failures in the dashboard. They showed as passes, with a small orange dot indicating a retry. The orange dots were everywhere. Nobody was reading them.

The actual flake rate we were dealing with: 18.3% of CI runs had at least one test that required a retry to pass. That's roughly 1 in 5 runs. Developer trust in the test suite was near zero. People merged with red CI because "it's probably just a flake." The suite had become noise.

Before fixing anything, I removed the automatic retries. Not because retries are always wrong — they're sometimes appropriate — but because I needed to see what was actually failing. Hidden flakes aren't fixable. You can only fix what you can see.

Day 1: the retry audit

The first day was purely diagnostic. I ran the full suite ten times on CI (via GitHub Actions, costing about forty minutes of machine time) with retries disabled and failures surfaced. I wrote a short script to parse the JUnit XML output and count how many times each test ID had a non-passing result.

# Parse JUnit XML for failing test names
grep -h '<testcase' cypress/results/*.xml \
  | grep 'failure\|error' \
  | sed 's/.*name="\([^"]*\)".*/\1/' \
  | sort | uniq -c | sort -rn \
  | head -20

The top twenty failures accounted for 80% of all flakes. This is almost always how it works — flakiness is not evenly distributed. It clusters. Most of the bottom hundred tests were rock solid.

The twenty worst offenders broke into three categories:

Network timing — tests asserting on data that came from an API call, where the assertion ran before the response arrived. Twelve tests.
Authentication overhead — tests that logged in via the UI on every run. The login form had a 300ms animation that occasionally pushed the email input outside the click target. Four tests.
Dead assertions — tests that had been updated to reflect UI changes but whose assertions hadn't been updated. They were testing elements that no longer existed. Eight tests.

Day 1 deliverable: a spreadsheet with every failing test, its category, and a proposed fix. Four hours of work.

Day 2: cy.intercept for the 12 worst offenders

The network timing group was the most impactful and the most fixable. All twelve were variants of the same problem: test clicks a button, expects some data to appear, data comes from an API call, the API call takes between 80ms and 600ms depending on CI load.

The wrong fix is cy.wait(1000) — a hard sleep. That just moves the threshold; it doesn't eliminate the race condition. When the API takes 700ms the flake is back.

The right fix is to intercept the request and wait for the response explicitly:

// Before: race condition
cy.get('[data-testid="load-users-button"]').click();
cy.get('[data-testid="user-row"]').should('have.length.greaterThan', 0);
 
// After: deterministic
cy.intercept('GET', '/api/users').as('getUsers');
cy.get('[data-testid="load-users-button"]').click();
cy.wait('@getUsers');
cy.get('[data-testid="user-row"]').should('have.length.greaterThan', 0);

cy.wait('@getUsers') blocks until the intercepted request completes. No race condition. The test takes the same amount of time (it still waits for the API), but it waits correctly instead of accidentally passing when the response is fast and failing when it's slow.

Twelve tests fixed, roughly three hours of work. Each fix was about four lines of change.

Day 3: cy.session everywhere we were logging in fresh

Four tests were logging in via the UI at the start of each test. This is the canonical flakiness source in Cypress: authentication flows involve animations, form submissions, redirects — the most timing-sensitive parts of any application.

The login via UI approach was the original implementation, written when the team didn't know about cy.session. The fix was straightforward:

// Before: UI login, runs in ~1.5s, occasionally flakes
beforeEach(() => {
  cy.visit('/login');
  cy.get('[data-testid="email"]').type(Cypress.env('TEST_EMAIL'));
  cy.get('[data-testid="password"]').type(Cypress.env('TEST_PASSWORD'));
  cy.get('[data-testid="submit"]').click();
  cy.url().should('include', '/dashboard');
});
 
// After: session-cached login, runs in ~30ms after first run
beforeEach(() => {
  cy.session('test-user', () => {
    cy.request('POST', '/api/auth/login', {
      email: Cypress.env('TEST_EMAIL'),
      password: Cypress.env('TEST_PASSWORD'),
    }).then(({ body }) => {
      window.localStorage.setItem('token', body.token);
    });
  });
  cy.visit('/dashboard');
});

After this change, login-dependent tests went from ~1.5 seconds per test (with occasional flakes) to ~30ms per test (zero flakes). The session is cached in the Cypress session store and restored from disk on subsequent runs. Only the first run in a session does the actual network request.

Day 4: deleting the 8 tests that were testing nothing

The eight "dead assertion" tests were the most uncomfortable to deal with — not technically, but politically. These were tests someone had written, which had once passed, and which still passed some of the time. Deleting them felt like reducing coverage.

But examining them closely, they were asserting on elements that had been removed from the UI six months ago. The element selectors returned empty sets, cy.get() retried for four seconds, and the tests timed out occasionally when CI was slow. When CI was fast, they passed because the assertion ran against the empty set before the timeout — a vacuous pass.

Here's the pattern that produced the false passes:

// This test passes vacuously when .old-banner doesn't exist at all
cy.get('.old-banner').should('not.exist');

If .old-banner was never in the DOM, the assertion should('not.exist') passes immediately. No retry needed. The test is testing nothing.

I deleted all eight. No replacement tests, just deletion. If the features they were originally testing need coverage, that coverage should be written fresh with current selectors and clear assertions.

The result and what we measured

After four days:

Flake rate: 18.3% → 1.9% (measured over 20 CI runs with retries disabled)
Average suite duration: 14m 22s → 11m 48s (faster because eight tests were deleted and four were now 30ms instead of 1.5s)
CI confidence: anecdotally, significant. In the week after the changes, there were zero "it's probably a flake, merge anyway" comments in pull requests.

The re-enabled retries setting at the end of week one: I added it back, but with retries: 1 (not 2) and with alerts on any test that uses a retry. The alert goes to a Slack channel. Any test that retries is now a tracked item, not background noise.

The 1.9% residual flake rate covers three tests we know about: one involving a third-party OAuth redirect we can't control, one involving file upload timing in Safari (our CI now runs Chrome only), and one that nobody's been able to reproduce reliably enough to diagnose. Those three are marked with // known flaky — see issue #312 and will get fixed when someone has the time to properly investigate.

Total time spent: roughly 20 hours across four days for one engineer. The productivity return — engineers not second-guessing CI, faster PR merges, fewer "it passed locally" arguments — paid that back within two weeks.

// related

Opinions·27 January 2026 · 8 min read

The flaky-test tax no one talks about

Flaky tests don't cost you in CI minutes. They cost you in developer trust. And the compounding interest on lost trust is the most expensive tax in engineering.

flaky-testsci-cdculture

Tutorials·24 February 2026 · 8 min read

cy.intercept the right way: aliases, stubs, and the bug it usually catches

cy.intercept is the most powerful command in Cypress and the one teams most often misuse. Here's the playbook: when to alias, when to stub, when to spy, and the race-condition-shaped bug that intercepts usually catch.

cypresstypescriptapi-testing