Q37 of 37 · API testing

How do you set quality SLAs for an API integration test suite owned by QA?

API testingLeadapileadershipmetricsslaslead

Short answer

Short answer: Three numbers: pass rate, escape rate, and runtime. Bands not points (pass rate 99-100%, escape rate < 1/quarter, full suite < 5 min). Publish weekly, review monthly, escalate on breach. Use the SLAs to drive *fixes*, not blame — a missed SLA is a question, not a failure.

Detail

SLAs that aren't measured drift. Numbers without consequences are decorative. The lead's job is to set both — and to use them to drive improvement, not to punish.

The three core SLAs:

1. Pass rate: % of tests that pass on the first run, ignoring retries.

  • Target: ≥99%.
  • Below 95%: investigate; quarantine the worst tests.
  • Below 90%: stop adding tests; fix the existing ones.

This is the leading indicator of suite health. Drift below 99% means flake is creeping in.

2. Escape rate: production incidents that the suite should have caught but didn't.

  • Target: ≤1 per quarter for critical paths.
  • This is the lagging indicator. Low escape rate means the suite has actual coverage.

Track each escape's root cause:

  • Coverage gap (test didn't exist) → add it.
  • Flaky test that was quarantined → fix and unquarantine.
  • Test passed but missed the bug (assertion gap) → strengthen the assertion.

3. Runtime: full suite time end-to-end.

  • Target: ≤5 minutes for the full integration suite; ≤60 seconds for smoke.
  • Below: every minute over 5 reduces developer trust.
  • Above: the suite gets bypassed.

Secondary metrics worth tracking:

  • Coverage by service / endpoint — keeps growth honest.
  • Test addition rate — how many tests were added per sprint.
  • Mean time to fix for failing tests — quick fixes signal a healthy suite; weeks-old broken tests signal abandonment.

Publication and review:

  • Weekly: pass rate and runtime in a 2-line update to the engineering channel. Just visibility, no commentary.
  • Monthly: escape-rate review with eng + product. Walk through any production incidents; assign owners for gaps.
  • Quarterly: target review. Adjust if the company has changed shape (new services, new risk profile).

Breach response:

  • Pass rate dips below 95% → QA platform sprint dedicated to flake.
  • Escape rate hits 2 in a quarter → joint eng+QA retro on coverage.
  • Runtime exceeds 8 minutes → optimisation sprint.

The cultural piece:

  • Bands, not points. "Pass rate ≥99%" is more honest than "100% always." Tests will fail; the SLA acknowledges that and asks "how often?"
  • Fixes, not blame. A missed SLA is a question — what's the root cause? The wrong response is firing or scolding the engineer who added the flaky test. The right response is "what made it flaky, what's the fix, what process change prevents this?"
  • Make targets visible. A dashboard everyone can see — eng, product, leadership. Visibility creates accountability without enforcement.

Anti-patterns:

  • 100% pass rate as the bar — meaningless; either the team retries until green or hides flakes.
  • "Coverage" as the only metric — easy to game with low-quality tests.
  • Targets without ownership — every metric needs an owner who's accountable for moving it.

The lead signal: setting clear bands, publishing transparently, treating SLA breaches as diagnostic, and tying targets to organisational outcomes (incidents, velocity).

// WHAT INTERVIEWERS LOOK FOR

Three core SLAs (pass rate, escape rate, runtime) with bands, monthly+quarterly cadence, and a fix-oriented breach response — never a blame culture.

// COMMON PITFALL

100% pass rate as the SLA. Either the team retries failing tests until green (real bugs hidden), or quarantines liberally (real coverage lost). Bands acknowledge reality.