Intelligent test selection
A CI run that takes 40 minutes does not need to run every test for every pull request. Most changes affect a predictable slice of the codebase — and the test suite knows this, because it has months of historical data on which tests fail when which files change. Predictive test selection is the model that reads that history and tells the runner which tests to execute. Run the right 20% in 8 minutes, catch 90% of failures, and save the full suite for the merge.
How predictive test selection works
Three inputs go in, a ranked test subset comes out — the model learns which tests are sensitive to which file changes.
Predictive test selection takes three inputs: the code diff for the current change, the historical pass/fail record for every test, and a mapping of which files each test touches or depends on. From these, the model produces a ranked list of tests ordered by their predicted probability of failing for this specific diff.
The training phase is offline. The model learns co-occurrence patterns: when file A was changed in the past, test T failed with probability p. When file A and file B both changed, tests T and U both failed. These patterns compress months of CI history into a model that can rank tests in milliseconds at runtime.
The inference phase runs at PR open. Your CI script calls the selection API — typically a hosted endpoint or a local model artefact — with the diff, and receives back a filtered test list. Vendors typically let you set a confidence target: "run enough tests to cover 90% of predicted failures", which translates to a specific subset size depending on the change.
The key output is not just which tests to run, but which tests not to run. Skipping slow, low-relevance tests on every PR compounds to significant savings — Meta's internal system (documented in Memon et al., ICSE 2018, "Predictive Test Selection") reported cutting CI time by 50% with minimal regression rate impact.
What the vendors actually do
A hosted ML model, a bundled CI observability layer, and an open-source static-analysis path — three distinct architectures for the same goal.
The vendor space splits cleanly into three categories: hosted ML models that you send your test history to, CI observability platforms that bundle test selection alongside broader pipeline analytics, and static-analysis tools that derive coverage from your build graph rather than training on history.
CloudBees Smart Tests (formerly Launchable) sits in the first category. You upload test results after each CI run; the hosted model trains on your history and returns a subset on each new run. Multi-language support covers Python, Ruby, Java, JavaScript, Go, and C/C++. The model is CI-server agnostic — it works with GitHub Actions, Jenkins, CircleCI, or any system that can call an HTTP endpoint. Launchable published case studies showing 50–80% CI time reductions; the hosted nature means you are sharing test metadata with a third party, which matters in regulated industries.
Datadog Test Optimization bundles Test Impact Analysis alongside CI Visibility — the same platform that gives you pipeline dashboards also analyses which tests to run. Language coverage is growing but has historically been strongest for JavaScript and Python. If your team already uses Datadog for infrastructure monitoring, adding test selection requires no additional vendor relationship.
Bazel's test target dependency analysis takes the third approach: instead of learning from historical failure patterns, it uses the build graph to determine which test targets are downstream of the changed files. This is deterministic and requires no training data, but it requires you to be on Bazel — a significant migration if you are not already. Google's predictive selection research (Machalica et al., FSE 2019) built on this static foundation by adding ML-predicted test flakiness into the ranking.
Meta TAO and Google's internal selection systems are documented in research papers (Memon et al., ICSE 2018; Machalica et al., 2019) and are not available as products. They are useful reference architectures showing what is achievable at scale — Meta's TAO paper describes selecting 7% of tests while retaining 96% of failure detection — but they are not a buying option.
| Approach | Hosted / self-hosted | Language support | Best-fit workload | |
|---|---|---|---|---|
| CloudBees Smart Tests | Hosted ML model | Hosted | ●Python, Ruby, Java, JS, Go, C/C++ | Mid-to-large teams with 6+ months CI history |
| Datadog Test Optimization | CI Visibility bundle | Hosted | JS, Python (growing) | ●Teams already on Datadog |
| Bazel dependency analysis | ●Static build-graph | Self-hosted | Bazel-supported | Monorepos already on Bazel |
| Meta TAO / Google internal | Internal research system | Internal only | N/A — not a product | Reference architecture only |
Predictive test selection landscape, May 2026
The cost and coverage trade-off
90% coverage in 20% of runtime is the headline — the footnote is that the 10% you missed can still reach production.
The economics of predictive test selection are attractive on paper: run 20% of your tests, catch 90% of failures. The headline number from Launchable's published case studies sits around this range for teams with established CI history. The asymmetry is real: you are trading the cost of running every test (high and certain) against the cost of a missed failure reaching production (potentially much higher, but rare).
The practical answer most mature teams land on is a two-tier strategy. On every PR, run the predicted subset. On merge to the main branch, run the full suite as a defensive backstop. The worst-case latency for discovering a missed failure is one merge cycle rather than one full CI run — acceptable for most teams, not for all.
The subset size is configurable. Setting the confidence target at 90% predicted coverage is not the same as setting it at 99%. Higher confidence means more tests included and less time saved. Most teams find the 85–95% range strikes the right balance; teams with high-stakes codebases (payments, healthcare) set it closer to 99% or rely exclusively on the full suite for anything merging to main.
Running 20% of your tests to catch 90% of failures is a great deal — until the 10% you missed lands in production on a Friday afternoon.
When predictive selection pays off
The break-even point is roughly a 10-minute CI suite and 3–6 months of historical run data.
Predictive test selection has a setup cost: the model needs training data. Launchable recommends at least three months of CI history before the model produces reliable rankings; six months is better. Teams with shorter histories, or teams that have recently migrated CI infrastructure and have a gap in their run history, will see degraded accuracy until the model has sufficient signal.
The break-even on CI time is usually around 10 minutes of suite runtime. Below that, the overhead of calling the selection API, formatting the diff, and waiting for a response starts to approach the time saved. A 3-minute test suite will not meaningfully benefit. A 40-minute test suite will see immediate returns.
Teams running fewer than 50 tests rarely see net benefit — the selection overhead is proportionately too high. Teams with 500+ tests and a CI runtime exceeding 15 minutes are the sweet spot.
False negatives — failures the subset misses — do occur. The model's confidence target is a probability, not a guarantee. Before deploying predictive selection to your main branch workflow, audit it against your last three months of CI history and calculate how many real failures would have been missed at your chosen confidence threshold. That number should inform your policy.
// NOTE