Q29 of 38 · CI/CD & DevOps
How do you detect flaky tests in CI at scale and manage a quarantine process?
Short answer
Short answer: Track each test's pass/fail history across runs. Mark a test flaky when it fails then passes on the same commit without a code change. Quarantine it by excluding it from the PR gate while investigation proceeds — but enforce a maximum quarantine period before deletion.
Detail
Flakiness detection requires per-test history, not just per-run pass/fail. Tools like Buildkite Test Analytics, Currents.dev, or a self-hosted JUnit XML database give you a failure rate per test over the last N runs.
A practical quarantine workflow: when a test flips result across two runs on the same commit, open an auto-generated issue and tag it @quarantine. The nightly pipeline includes quarantined tests (to catch ones that are consistently failing), but the PR gate excludes them so they do not block merges.
Quarantine must have a SLA — if a test is quarantined for more than two weeks without a fix, it gets deleted. A quarantine that fills indefinitely becomes a graveyard that erodes confidence in the suite. Active triage beats passive accumulation.
// WHAT INTERVIEWERS LOOK FOR
// Related questions