Intelligent test selection

10 min read · Reviewed May 2026 · selection

A CI run that takes 40 minutes does not need to run every test for every pull request. Most changes affect a predictable slice of the codebase — and the test suite knows this, because it has months of historical data on which tests fail when which files change. Predictive test selection is the model that reads that history and tells the runner which tests to execute. Run the right 20% in 8 minutes, catch 90% of failures, and save the full suite for the merge.

READ TIME10 min

DIFFICULTYintermediate

REVIEWEDMay 2026

YOU'LL LEARNHow predictive test selection works, when it pays off, and which vendor model fits your team.

How predictive test selection works

Three inputs go in, a ranked test subset comes out — the model learns which tests are sensitive to which file changes.

Predictive test selection takes three inputs: the code diff for the current change, the historical pass/fail record for every test, and a mapping of which files each test touches or depends on. From these, the model produces a ranked list of tests ordered by their predicted probability of failing for this specific diff.

The training phase is offline. The model learns co-occurrence patterns: when file A was changed in the past, test T failed with probability p. When file A and file B both changed, tests T and U both failed. These patterns compress months of CI history into a model that can rank tests in milliseconds at runtime.

The inference phase runs at PR open. Your CI script calls the selection API — typically a hosted endpoint or a local model artefact — with the diff, and receives back a filtered test list. Vendors typically let you set a confidence target: "run enough tests to cover 90% of predicted failures", which translates to a specific subset size depending on the change.

The key output is not just which tests to run, but which tests not to run. Skipping slow, low-relevance tests on every PR compounds to significant savings — Meta's internal system (documented in Memon et al., ICSE 2018, "Predictive Test Selection") reported cutting CI time by 50% with minimal regression rate impact.

Predictive test selection at runtime

What the vendors actually do

A hosted ML model, a bundled CI observability layer, and an open-source static-analysis path — three distinct architectures for the same goal.

The vendor space splits cleanly into three categories: hosted ML models that you send your test history to, CI observability platforms that bundle test selection alongside broader pipeline analytics, and static-analysis tools that derive coverage from your build graph rather than training on history.

CloudBees Smart Tests (formerly Launchable) sits in the first category. You upload test results after each CI run; the hosted model trains on your history and returns a subset on each new run. Multi-language support covers Python, Ruby, Java, JavaScript, Go, and C/C++. The model is CI-server agnostic — it works with GitHub Actions, Jenkins, CircleCI, or any system that can call an HTTP endpoint. Launchable published case studies showing 50–80% CI time reductions; the hosted nature means you are sharing test metadata with a third party, which matters in regulated industries.

Datadog Test Optimization bundles Test Impact Analysis alongside CI Visibility — the same platform that gives you pipeline dashboards also analyses which tests to run. Language coverage is growing but has historically been strongest for JavaScript and Python. If your team already uses Datadog for infrastructure monitoring, adding test selection requires no additional vendor relationship.

Bazel's test target dependency analysis takes the third approach: instead of learning from historical failure patterns, it uses the build graph to determine which test targets are downstream of the changed files. This is deterministic and requires no training data, but it requires you to be on Bazel — a significant migration if you are not already. Google's predictive selection research (Machalica et al., FSE 2019) built on this static foundation by adding ML-predicted test flakiness into the ranking.

Meta TAO and Google's internal selection systems are documented in research papers (Memon et al., ICSE 2018; Machalica et al., 2019) and are not available as products. They are useful reference architectures showing what is achievable at scale — Meta's TAO paper describes selecting 7% of tests while retaining 96% of failure detection — but they are not a buying option.

	Approach	Hosted / self-hosted	Language support	Best-fit workload
CloudBees Smart Tests	Hosted ML model	Hosted	●Python, Ruby, Java, JS, Go, C/C++	Mid-to-large teams with 6+ months CI history
Datadog Test Optimization	CI Visibility bundle	Hosted	JS, Python (growing)	●Teams already on Datadog
Bazel dependency analysis	●Static build-graph	Self-hosted	Bazel-supported	Monorepos already on Bazel
Meta TAO / Google internal	Internal research system	Internal only	N/A — not a product	Reference architecture only

Predictive test selection landscape, May 2026

The cost and coverage trade-off

90% coverage in 20% of runtime is the headline — the footnote is that the 10% you missed can still reach production.

The economics of predictive test selection are attractive on paper: run 20% of your tests, catch 90% of failures. The headline number from Launchable's published case studies sits around this range for teams with established CI history. The asymmetry is real: you are trading the cost of running every test (high and certain) against the cost of a missed failure reaching production (potentially much higher, but rare).

The practical answer most mature teams land on is a two-tier strategy. On every PR, run the predicted subset. On merge to the main branch, run the full suite as a defensive backstop. The worst-case latency for discovering a missed failure is one merge cycle rather than one full CI run — acceptable for most teams, not for all.

The subset size is configurable. Setting the confidence target at 90% predicted coverage is not the same as setting it at 99%. Higher confidence means more tests included and less time saved. Most teams find the 85–95% range strikes the right balance; teams with high-stakes codebases (payments, healthcare) set it closer to 99% or rely exclusively on the full suite for anything merging to main.

Running 20% of your tests to catch 90% of failures is a great deal — until the 10% you missed lands in production on a Friday afternoon.

When predictive selection pays off

The break-even point is roughly a 10-minute CI suite and 3–6 months of historical run data.

Predictive test selection has a setup cost: the model needs training data. Launchable recommends at least three months of CI history before the model produces reliable rankings; six months is better. Teams with shorter histories, or teams that have recently migrated CI infrastructure and have a gap in their run history, will see degraded accuracy until the model has sufficient signal.

The break-even on CI time is usually around 10 minutes of suite runtime. Below that, the overhead of calling the selection API, formatting the diff, and waiting for a response starts to approach the time saved. A 3-minute test suite will not meaningfully benefit. A 40-minute test suite will see immediate returns.

Teams running fewer than 50 tests rarely see net benefit — the selection overhead is proportionately too high. Teams with 500+ tests and a CI runtime exceeding 15 minutes are the sweet spot.

False negatives — failures the subset misses — do occur. The model's confidence target is a probability, not a guarantee. Before deploying predictive selection to your main branch workflow, audit it against your last three months of CI history and calculate how many real failures would have been missed at your chosen confidence threshold. That number should inform your policy.

// NOTE

Predictive selection pays off when test suite runtime exceeds developer feedback tolerance — usually 10+ minutes. Below that, the engineering cost of integration outweighs the time saved. The model also needs 3–6 months of historical run data to produce reliable rankings.