AI for flaky test detection
Every team has tests that fail without the code being wrong. The naive response is to auto-retry: run the failing test three times and call it a pass if it passes once. That works until it doesn't — when a genuine regression hides behind the re-run policy and reaches production wearing a flaky disguise. The middle path is classification: identify each failure as genuinely flaky or genuinely broken, and handle each accordingly. That classification problem is where AI earns its place in your CI pipeline.
What 'flaky' actually means
Same commit, multiple runs, inconsistent results — the definition is simple, the causes are not.
A test is flaky if, on the same commit with no code changes, it produces different results on different runs. Passing on run 1, failing on run 3, passing on run 5 — without anyone changing anything. The definition is simple. The causes are not.
Three root categories cover most flakiness: timing and race conditions (the test depends on an operation completing within a fixed window, and sometimes the window is too short), infrastructure and environment instability (the container running the test behaves differently under load, or a shared external service is slow), and test-state pollution (a previous test leaves state in the database or in-process memory that this test does not expect).
Heuristic re-run policies — retry-on-failure — are a tax on the codebase. They increase CI time, mask the signal that a test is telling you, and, critically, they do not distinguish between flaky failures and real ones. A three-retry policy that auto-passes a genuinely broken test three consecutive times is not unusual. Auto-retry is not flaky-test management; it is flaky-test concealment.
The legitimate use of re-run is as a data-collection mechanism: run the failing test N times, observe whether it exhibits inconsistency, and use that data to classify it. That is different from re-run as a pass policy.
How AI-driven flaky detection works
Classification replaces retry logic — the model learns which failure patterns correlate with flakiness vs genuine breakage.
AI-driven flaky detection works by learning the signatures of flaky failures and using them to classify incoming failures before a human has to look at them. The inputs are failure metadata: the error message, the stack trace, the test name, the timing, the environmental context, and — crucially — the historical pattern of that test across previous runs.
A test that has passed 200 times and failed once, with a stack trace involving a timeout, is almost certainly flaky. A test that has passed 200 times and failed once with a NullPointerException at a specific line that was changed in the current PR is almost certainly a genuine regression. A trained model distinguishes these with high accuracy; the naive retry policy cannot.
Pattern matching on error messages is the simplest approach and works for well-known flakiness signatures: socket timeouts, "element not visible", "connection refused". More sophisticated classifiers use embedding-based similarity to match new failures against a labelled history of flaky vs genuine failures — Google's internal flake classification research (documented in their engineering blog series on test infrastructure) uses this approach at scale.
Vendor landscape, May 2026
Purpose-built flaky-test tools, CI observability bundles, and the build-your-own Bayesian path.
Trunk Flaky Tests (trunk.io/products/flaky-tests) is the purpose-built option. Test results are uploaded after each CI run; the platform identifies tests showing inconsistent patterns across runs and automatically quarantines them — creating a ticket in Jira or Linear and marking the test as skipped in subsequent runs. Trunk offers a free tier for open-source projects. The focus is narrow: flaky test detection and quarantine, not broader CI observability.
Datadog Test Optimization has a built-in flaky-test management layer within its CI Visibility product (docs.datadoghq.com/tests/flaky_tests). Tests are tagged across multiple dimensions: is_flaky (any historical flakiness), is_new_flaky (first time flaking), and is_known_flaky (existing pattern). Its Early Flake Detection feature retries newly-added tests up to ten times to determine their stability before they enter the main suite. Datadog's advantage here is correlation with infrastructure metrics — if a test starts flaking at the same time as CPU pressure increases on your CI runner, that co-occurrence is visible in the same dashboard.
TestDino is a 2026 entrant positioning as a dedicated flaky-test stability tracker with ML-driven scoring. The stability score for each test reflects its pass rate, run count, and volatility across recent runs. Best suited to teams who want flaky-test tracking as a dedicated product rather than an add-on to a broader observability platform.
Custom Bayesian classifiers are what large internal platform teams build when the off-the-shelf options do not fit their scale or privacy constraints. The practical challenge is that you are building a model that needs continuous retraining as the test suite evolves; this is a meaningful ML infrastructure commitment.
| Detection approach | Quarantine behaviour | Pricing model | Best-fit | |
|---|---|---|---|---|
| Trunk Flaky Tests | Result pattern matching | ●Auto-quarantine + ticket | Free for open source | Mid-size teams, purpose-built |
| Datadog Test Optimization | ●is_flaky tags + Early Flake Detection | Tagged + skipped | Usage-based | Teams already on Datadog |
| TestDino | ML stability scoring | Dashboard-driven | SaaS subscription | ●Dedicated tracker, no full observability |
| Custom Bayesian classifier | Build-your-own model | Custom logic | Engineering time | Large platform teams with ML capacity |
Flaky-test vendor landscape, May 2026
The auto-quarantine debt
Quarantined tests skip silently. Without a review process, the quarantine list becomes a permanent feature of the codebase.
Auto-quarantine is the right short-term response to a confirmed flaky test: skip it, do not block CI on it, create a tracking ticket. But quarantine is not resolution. A quarantined test represents a part of the test suite that is no longer doing its job — it has been removed from the safety net.
Teams that quarantine aggressively and review rarely accumulate dark debt: a growing list of skipped tests, each one a gap in coverage that nobody is responsible for. The monthly review cadence is not glamorous. It is necessary.
A useful framing: treat the quarantine list as a bug backlog. Every item on it has a root cause, an owner, and a target resolution date. When the list grows faster than it shrinks, something is systemically wrong with either test quality or the infrastructure — and no amount of AI classification will fix it.
// WARNING
// Read more