How one flaky test blocked trust in the whole suite
It started with a single test that failed maybe one run in ten. Within a month the team was re-running CI on every red, ignoring failures, and merging on faith. One flaky test had quietly trained everyone to distrust all of them.
This is a case study, details blurred, about how a single intermittent test poisoned an entire suite — not by being wrong often, but by teaching a team that red doesn't mean broken. The technical bug was small. The cultural damage was the real incident, and it's the part worth studying.
Context
A healthy end-to-end suite, a few hundred tests, green most of the time, trusted by the team as the gate before merge. Then a new test was added for a feature with an animation and an async data load — and it began to fail intermittently, passing on a re-run with no code change.
Symptoms
At first, an occasional red build that "went away when you re-ran it." Annoying, not alarming. But the behaviour it trained spread fast:
- People started re-running CI reflexively on any red, before even reading the failure.
- Then they started merging through red builds — "it's probably just the flaky one."
- Within weeks, every failure was assumed to be flake. Two genuine regressions were waved through that way before anyone noticed the suite had stopped being a gate at all.
That's the real cost: not the one bad test, but the habit it created of paying the flaky-test tax until the whole signal was worthless.
Investigation
When we finally sat down with the flaky test, the cause was ordinary. It asserted on an element immediately after triggering an action, with a fixed sleep-style wait tuned to a fast local machine. In CI — slower, shared, variably loaded — the data sometimes hadn't arrived when the assertion ran. Fast box: pass. Busy CI box: fail. Pure timing, no real bug in the product. The classic shape of flake: a test racing the application instead of waiting for it.
Root cause
Two layers. The technical root cause was a hard-coded wait instead of waiting on the actual condition (the element being present and populated) — a race the test lost whenever CI was slow. The organisational root cause was worse and the real subject of this study: the team had no policy for flake, so a known-unreliable test was allowed to stay in the merge-blocking suite. Once one red could be dismissed as "just the flaky one," every red could be — and the suite's entire value, that red means stop, evaporated.
What the tests missed
Nothing was missed at the assertion level; the test was checking the right thing. What was missing was trust hygiene: no quarantine for unreliable tests, no tracking of flake rate, no rule that a flaky test gets fixed or removed rather than tolerated. The suite had no immune system, so one infection spread to the whole organism's credibility. By the time we looked, the genuine regressions that slipped through were the proof that the gate had already failed.
The reusable lesson
We fixed the test (wait on the condition, not the clock), but the durable fix was a flake policy: a flaky test is quarantined immediately — pulled out of the merge gate into a non-blocking lane — then fixed or deleted within a set window, never left to rot in the blocking suite. A test you don't trust is worse than no test, because it trains people to ignore failures, and that training generalises to the tests that are telling the truth. Protect the signal: the moment "it's probably just flake" becomes a normal sentence, your suite has already stopped working.
Flaky-test case lessons
- One tolerated flaky test trains the team to dismiss all red builds — the damage generalises
- Quarantine flaky tests immediately (non-blocking lane), then fix or delete within a set window
- Never leave a known-unreliable test in the merge-blocking suite
- Fix flake by waiting on the actual condition, not a fixed timer tuned to a fast machine
- Track flake rate; "re-run until green" is a symptom, not a fix
- A test you don't trust is worse than no test — it erodes trust in the ones that work
// RELATED QA.CODES RESOURCES
Tool
Common Bug
// related
The bug that only happened after daylight saving time changed
A case study: a scheduling bug that stayed invisible until the clocks changed — and the test scenarios that would have caught it.
The checkout bug that passed every happy-path test
Every checkout test was green, but combining two discounts and a gift card drove the total negative — and issued credit. A case study in testing invariants, not just features.