Scaling From 50 to 5000 Tests

9 min read

A suite of 50 tests with a 10-minute runtime is a convenience. A suite of 5000 tests with a 45-minute runtime is a blocker. Engineers stop running tests before committing. PRs sit waiting for a slow CI pipeline. Feedback loops stretch from minutes to hours. The framework patterns that work perfectly at 50 tests — sequential execution, a single test runner, one CI job — become bottlenecks at 500 and unsustainable at 5000. Scaling is not an accident. It requires deliberate decisions about how tests are categorised, when they run, how they're distributed across machines, and which tests are kept versus deleted. This lesson covers each of those decisions.

How requirements change at each scale

The scaling levers — applied in sequence

No single lever solves a 45-minute runtime. The approach is layered: apply each lever, measure, then apply the next.

Lever 1: Parallelise within one machine

The cheapest scaling move — more threads, same hardware. thread-count="4" in TestNG XML costs nothing except ensuring tests are isolation-correct:

<suite name="Regression" parallel="methods" thread-count="4" verbose="1">
    <test name="All tests">
        <packages>
            <package name="com.mycompany.tests"/>
        </packages>
    </test>
</suite>

Expected throughput gain: roughly linear up to the CPU/memory ceiling. A 40-minute sequential suite typically reaches 12–15 minutes with 4 properly isolated threads.

In Playwright, parallelism is controlled per-worker:

// playwright.config.ts
workers: process.env.CI ? 4 : undefined,  // 4 workers in CI, logical CPUs locally

pytest-xdist adds distributed workers to pytest:

pip install pytest-xdist
pytest -n 4   # 4 parallel workers

Lever 2: Categorise and run subsets

Not every test should run on every trigger. A push to a feature branch shouldn't run 5000 tests — it should run 100 smoke tests in 5 minutes.

TestNG groups:

@Test(groups = {"smoke", "login"})
public void validLoginRedirectsToDashboard() { ... }
 
@Test(groups = {"regression", "slow", "checkout"})
public void fullCheckoutWithPromoCode() { ... }
<!-- Smoke suite for PR triggers -->
<groups>
    <run>
        <include name="smoke"/>
    </run>
</groups>

pytest marks:

@pytest.mark.smoke
@pytest.mark.login
def test_valid_login_redirects():
    ...
 
# Run only smoke tests
pytest -m smoke

Playwright tags:

test("login with valid credentials @smoke @login", async ({ loginPage }) => {
    ...
});
npx playwright test --grep "@smoke"

The standard tagging taxonomy:

TagSizeWhen it runsPurpose
smoke50–100 testsEvery PR, every mergeCritical path: can we deploy?
regressionAll testsNightly, before releasesFull coverage
slow5–10% of suiteNightly onlyDB-heavy, multi-step flows
Feature tagsBy areaWhen feature area changesTargeted regression

Lever 3: Distribute across machines

When one machine with 8 threads isn't enough, distribute the suite across multiple machines:

Selenium Grid 4:

# docker-compose.yml — Grid with 4 Chrome nodes
services:
  hub:
    image: selenium/hub:4.20.0
    ports: ["4442:4442", "4443:4443", "4444:4444"]
  chrome:
    image: selenium/node-chrome:4.20.0
    deploy:
      replicas: 4
    environment:
      SE_EVENT_BUS_HOST: hub

GitHub Actions matrix — shard across parallel jobs:

strategy:
  matrix:
    shard: [1, 2, 3, 4]
 
steps:
  - name: Run tests (shard ${{ matrix.shard }}/4)
    run: npx playwright test --shard=${{ matrix.shard }}/4

Playwright's --shard=N/M splits the test files into M groups and runs group N. Four parallel jobs each run a quarter of the suite — total runtime drops to 25% of a single-machine sequential run.

Lever 4: Replace slow UI setup with API setup

The biggest individual test performance wins come from replacing UI-based test data setup with API calls. Logging in through the UI for 50 tests that need an authenticated user takes 2–3 seconds per test (100–150 seconds total). Injecting a session cookie or calling the auth API takes 100–200ms:

// Playwright — reuse auth state across tests (set up once per worker)
test.use({ storageState: "playwright/.auth/user.json" });
 
// In global setup:
await page.goto("/login");
await page.fill("#email", config.userEmail);
await page.fill("#password", config.userPassword);
await page.click("#submit");
await page.context().storageState({ path: "playwright/.auth/user.json" });

The 50 tests that previously logged in through the UI each now start authenticated — saving 100+ seconds of browser interaction from the suite.

Lever 5: Retire obsolete tests

Every 6 months, run a coverage analysis. Tests that:

  • Duplicate coverage of another test exactly
  • Test functionality that was removed from the application
  • Have been skipped for more than 3 months
  • Retry 50%+ of the time and have never been fixed

...should be deleted. A suite of 4500 well-maintained tests is faster, more reliable, and easier to understand than a suite of 5000 tests where 500 are dead weight.

The 1-hour rule

A guiding constraint: full regression should complete in under 1 hour. This is the maximum feedback cycle that allows a nightly run to be reviewed and acted on before the next work day starts. When the suite exceeds 1 hour:

  1. Profile for the slowest 10% of tests — optimise or remove the worst offenders first.
  2. Verify parallel thread count hasn't been limited unnecessarily.
  3. Add a shard to the CI matrix.
  4. Check whether slow tagged tests can be moved to a separate less-frequent pipeline.

⚠️ Common mistakes

  • Parallelising before ensuring test isolation. Enabling thread-count="4" on a suite with shared static state produces race conditions and flaky failures that are far harder to diagnose than the original slow suite. Validate isolation first; parallelise after.
  • Running all 5000 tests on every PR. This is both slow and expensive. Engineers bypass the CI check ("the PR is green" means "the smoke suite is green, regression is still running"). Define the PR gate as the smoke suite; run full regression nightly.
  • Never deleting tests. A test that covers removed functionality still runs, still takes time, and still occasionally breaks on unrelated infrastructure changes. Treat obsolete tests as technical debt — they have a maintenance cost with zero coverage return.

🎯 Practice task

Implement scaling strategies for your suite — 40 minutes.

  1. Baseline measurement. Time your full suite with a single thread. Record: total time, tests per minute, slowest 5 tests by duration (add timing to your reporter or use TestNG's built-in execution summary).
  2. Implement smoke tags. Add a smoke group or mark to the 10 most critical tests in your suite. Create a separate TestNG XML or pytest marker that runs only smoke. Verify these 10 tests run in under 3 minutes.
  3. Enable parallelism. Set thread-count="3" in your TestNG XML (or workers=3 in Playwright config). Run the full suite. Compare wall-clock time to the baseline. Note any failures — these are isolation violations to fix.
  4. Profile the slow tail. Identify the 5 slowest tests from your timing data. For each: is the slowness from UI setup that could be API setup? Is there an unnecessary full-page navigation? Fix at least one.
  5. Stretch — GitHub Actions matrix. If your project is on GitHub: add a 2-shard matrix to your CI workflow file. Verify that half the tests run in each shard and the total wall time drops by roughly half versus a single-job run. Record the before and after times.

Next lesson: framework documentation and onboarding — how to make your framework an asset that survives team turnover rather than a mystery that only its creator understands.

// tip to track lessons you complete and pick up where you left off across devices.