Maintaining and Scaling Python Test Suites

A 30-test suite is easy to keep healthy — flakes are rare, runs finish in 90 seconds, the whole thing fits in one engineer's head. A 300-test suite is a different beast: one flaky test per 100 means three failed CI runs a day, and "the suite is slow today" turns into a steady drag on team velocity. The skills for getting to a working test framework are not the same as the skills for keeping one healthy as it grows. This last lesson covers the operational discipline — flake identification and quarantine, retry-vs-fix decisions, the four metrics that actually matter, the smoke/regression/full-tier model, and the maintenance habits that keep a 300-test suite from becoming a 300-test liability.

The first symptom — flake

A flaky test passes one run, fails the next, with no code change between. It's the single worst kind of test result because it teaches the team to ignore failures. Once "just re-run it" becomes the default response to a red CI, real failures get re-run too, and bugs reach production.

The standard Python tools for identifying flake:

pip install pytest-repeat pytest-rerunfailures
 
# Run a single test 5 times, stop on first failure
pytest tests/auth/test_login.py::test_admin --count 5 -x
 
# Re-run failed tests up to 2 times automatically (catches flake during dev)
pytest --reruns 2 --reruns-delay 1

pytest-repeat's --count N -x is the diagnostic tool — run a suspected flake N times and watch how often it actually fails. If it fails 2/5, it's flaky. If it fails 5/5, it's broken (and an easier fix).

pytest-rerunfailures's --reruns 2 is the coping tool — when CI re-runs failed tests automatically, transient flakes (network blips, slow staging) don't fail the build. Don't use --reruns to paper over real flake — it hides the problem. Use it for genuinely transient issues only, and track flake rate as a metric.

Flake quarantine — `@pytest.mark.flaky`

When a test is flaky and you don't have time to fix it right now, quarantine it: mark it so it doesn't gate CI:

@pytest.mark.flaky
def test_eventually_consistent_dashboard(page):
    # ... known-flaky test we'll fix in JIRA-456 ...

markers =
    flaky: known-flaky tests, kept out of the gating run

Run CI with pytest -m "not flaky" for gating, pytest -m flaky for the quarantine job (which runs without blocking). The flaky test still runs, the team still sees its results, but a flake doesn't break the merge button.

The discipline: every flaky marker has a JIRA ticket, every ticket has an owner, and the count of flakies is reviewed weekly. Without that discipline, "quarantine" becomes "graveyard" — tests rot, no one fixes them, the suite shrinks in real coverage even as it grows in test count.

The fix-flake-at-source playbook

When you do fix a flaky test, the four most common root causes:

Snapshot-style assertions instead of expect(...). assert page.locator(".x").text_content() == "Done" snapshots once. The fix: expect(page.locator(".x")).to_have_text("Done") retries until the timeout.
Hardcoded time.sleep(). Sleeps are fast when the page is slow and slow when the page is fast — both bad. The fix: replace with expect(...).to_be_visible() (auto-retries) or page.wait_for_response(...) (waits for a specific network event).
Shared state across tests. Test A creates a user; test B happens to use the same email and collides. The fix: data factories from the previous lesson.
Assumptions about backend timing. "After clicking Save, the dashboard reflects the change immediately." Sometimes it does, sometimes there's a 2-second async job. The fix: assert on the eventually-consistent state (expect(...).to_be_visible(timeout=10_000)), or use API verification rather than UI polling.

For each flaky test, work down the list. 80% of flakes are one of the first two.

Test independence — the rule that prevents most flake

Every test must:

Create its own data (no shared fixtures that leak state).
Run in any order without changing the outcome.
Run in parallel with any other test without collision.

The simple test: pytest tests/ --random-order (with pytest-randomly). If the suite passes randomised, your tests are independent. If it fails randomised but passes alphabetical, you have ordering dependencies — fix those before adding more tests.

The four metrics that matter

The dashboard a healthy 300-test suite tracks

The four numbers most teams converge on:

Total tests. Trended weekly. If it's not growing, coverage isn't growing either.
Flaky tests. Trended weekly. Must trend down — if it's growing, the team is adding tests faster than they're stabilising.
Pass rate over the last 30 CI runs. A healthy suite is 98%+. Below 95% means the team has lost confidence, which means failures get ignored.
Wall time. The thing engineers feel directly. Above ~10 minutes for the gating tier, PRs start stacking up while CI runs.

Track these four; ignore the rest. "Lines of test code" is not a metric. "Time spent writing tests" is not a metric. The suite's health is what these four numbers say.

Keeping the suite fast

Five techniques, in order of impact:

Storage state for auth (chapter 5). Login once per session, not once per test. Saves ~5s × test count.
API setup for data (chapter 4). Don't create users via UI sign-up forms when you can POST. Saves ~10s per test that needs setup.
Parallel execution with xdist (chapter 7). N workers ≈ N× speedup. Free.
Block unneeded resources. page.route("**/*.{png,jpg,jpeg,woff2}", lambda r: r.abort()) saves 1-2 seconds per page load.
Skip expensive tests on the fast tier. Visual and a11y tests don't need to run on every PR — gate them on a marker and run nightly.

Apply all five and a 300-test suite that ran in 60 minutes serially can run in 8 minutes parallel.

The three-tier strategy

Most teams arrive at the same shape: three slices of the same suite, three different gating policies.

Smoke (every PR): 20 tests, < 2 minutes wall time, runs on every push. Gates merges.
Regression (on push to main, nightly): 200 tests, < 15 minutes wall time, runs after merge. Catches issues smoke missed.
Full (weekly, release candidates): all tests including slow/visual/a11y, all browsers, all viewports. Runs Sunday night for Monday review.

# Smoke
pytest -m smoke -n 4 --browser chromium
 
# Regression (skips slow, runs both browsers)
pytest -m "regression and not slow" -n 8 --browser chromium --browser firefox
 
# Full
pytest -n 16 --browser chromium --browser firefox --browser webkit

Markers from chapter 3 plus xdist from chapter 7 plus smart matrix runners from this chapter — that's the whole engine.

When to delete tests

Test code is code. Code that doesn't add value should be deleted. Three cases:

The feature was removed. Delete the tests immediately. Stale tests against removed features confuse new contributors and bloat CI.
The test never finds bugs and nobody updates it. A test that's been green for 12 months might be valuable (it catches regressions) or worthless (it asserts something that never changes). Audit. If the assertion is on something that can't realistically break, delete.
The test duplicates another test. Two tests covering the same flow with different framings doubles maintenance for the same coverage. Pick the better-named one, delete the other.

A useful rule of thumb: if a test fails only because someone changed an unrelated file, it's probably testing implementation details rather than behaviour. Either rewrite it to assert on user-visible behaviour, or delete.

Code review for tests

Test code reviewed less rigorously than production code is the seed of every test-suite-rot story. The review checklist:

Does this test fail for a reason? (Or does it just exercise code without verifying behaviour?)
Are locators role/label/text-based, not CSS-class-based?
Is data created by a factory, not hardcoded?
Are assertions expect(...) (auto-retrying), not snapshot-style?
Is there a time.sleep() anywhere? (If yes, the reviewer rejects.)
Does the test name describe what's being verified, not the steps performed?

A team that holds the line on these in review prevents 80% of flake at the source.

Linting test code

Test code is real code. Lint it. Two tools:

pip install ruff black
 
ruff check tests/ pages/ utils/
black tests/ pages/ utils/

ruff catches unused imports, undefined names, common Python mistakes. black formats consistently. Both run in pre-commit hooks (via pre-commit) so violations are caught before the PR.

A pyproject.toml block:

[tool.ruff]
line-length = 100
extend-select = ["I", "B", "UP"]  # imports, bugbear, pyupgrade
 
[tool.black]
line-length = 100

For a Playwright Python project specifically, also lint for the patterns that signal flake:

Catch time.sleep calls — usually a sign of broken auto-wait.
Catch await inside def (sync API misuse).
Catch hardcoded emails and URLs in test bodies.

Custom ruff rules or a simple grep-in-pre-commit handles all three.

The growth curve — 30 to 300 tests

The shape every team's test suite follows:

0-30 tests. Everything fits in one folder. No need for markers, no need for fixtures, parallelism doesn't matter. The work is establishing the patterns.
30-100 tests. Add feature folders, root + per-feature conftest, markers, parallel execution. The work is establishing the framework.
100-300 tests. Flake starts mattering. Add quarantine markers, retry policies, tier separation (smoke vs regression). The work is operational hygiene.
300+ tests. Sharding across CI runners, dedicated visual/a11y tiers, full Allure-history dashboards. The work is throughput optimisation.

Your suite will pass each threshold. Anticipate them — set up markers and feature folders at 30 tests so the structure is in place when you hit 100. Set up flake quarantine at 100 so you have the muscle when you hit 300.

Coming from Playwright TypeScript?

The TS course's "Maintaining and Scaling" lesson covers the same operational concerns:

TS test.fixme() → Python @pytest.mark.flaky or @pytest.mark.xfail
TS playwright.config.ts retries: 2 → Python pytest --reruns 2
TS describe.serial() → Python @pytest.mark.serial (custom marker that blocks xdist)
TS test sharding via --shard 1/4 → Python pytest-split or pytest-shard

Same problems, same shape of solution. The Python ecosystem has more plugins (pytest-randomly, pytest-repeat, pytest-rerunfailures) but the core discipline — track flake, quarantine without deleting, fix at source — is identical.

⚠️ Common mistakes

Treating --reruns as a fix. Auto-retry hides flake; it doesn't remove it. Use sparingly, track which tests rely on retries to pass, fix those first. A suite where 30% of tests need a retry is a broken suite, not a stable one.
Letting xfail and flaky markers accumulate without ownership. Every quarantine marker is a debt. Without a JIRA ticket and an owner, the count grows monotonically. Audit weekly; enforce a maximum count (e.g., "no more than 5 flaky markers in the suite at any time"); demand a ticket link on every marker.
Adding tests faster than you stabilise them. A team that ships 10 tests per sprint and stabilises 2 per sprint has a flake-rate that grows unboundedly. Match the budget — if you're adding 10, you're stabilising 10. Otherwise the suite eventually becomes more noise than signal.

🎯 Practice task

Set up the operational hygiene for your test suite. 30-40 minutes.

Install diagnostic plugins:

pip install pytest-repeat pytest-rerunfailures pytest-randomly

Pick any test in your suite and run it 10 times to check stability:
```
pytest tests/auth/test_login.py::test_login_succeeds --count 10 -v
```
If 10/10 pass, the test is stable. If anything fails, capture the failure mode — that's flake to track.
Run the whole suite in random order:
```
pytest tests/ -v
```
pytest-randomly is auto-active once installed. Run twice. If the order changes and the suite still passes, your tests are independent. If a different order fails, you have an ordering dependency to fix.
Tag a flaky test. Add @pytest.mark.flaky to a test you suspect (or a real flake), register the marker in pytest.ini, and confirm pytest -m "not flaky" skips it.
Set up retries for the gating tier only. In your CI workflow, run smoke tier with --reruns 2 --reruns-delay 1 so transient blips don't break PRs:
```
- run: pytest -m smoke -n 4 --reruns 2 --reruns-delay 1
```

Track the four metrics. Print them at the end of every CI run via a tiny script:

# ci/print_metrics.py
import json, glob
results = json.load(open(glob.glob("allure-results/*.json")[0]))
# ... extract pass count, fail count, duration ...
print(f"Total: {total}, Pass rate: {pass_rate:.1%}, Wall time: {duration}m")

Or use --junitxml and parse the XML. The point: the four numbers should be visible at the end of every run.

Set up linting. Add ruff and black to pyproject.toml. Run ruff check tests/ pages/ utils/; fix the warnings. Add a pre-commit hook so future violations get caught at commit time.
Stretch: define your three-tier strategy in .github/workflows/. Three workflow files (or three jobs in one file) — smoke on every PR, regression on push to main, full on schedule:. Match the marker filters and parallelism levels from earlier in this lesson.

You've finished Chapter 8 and the course's framework material. The next chapter is the capstone — applying everything you've learned in this course to build a complete, production-quality test suite for a real-world Todo application end-to-end.