Back to Blog
On this page3 sections

// deep dive

How to set realistic performance thresholds

qa.codesqa.codes · 13 June 2026 · 8 min read
IntermediatePerformance testersQA Leads
performance-testingthresholdsslostrategy

A performance test needs a pass/fail line, and most teams pick one out of the air — "response time under 2 seconds" — with no basis. Here's how to set thresholds that mean something instead of numbers that just feel safe.

A performance test without a threshold is just a number generator; a performance test with a made-up threshold is worse, because it gives false confidence or false alarms. "Under 2 seconds" sounds reasonable and is usually meaningless — too loose for a fast API, too strict for a heavy report. The skill is deriving thresholds from something real: user expectation, current behaviour, and business impact. Then they become a CI gate you can trust rather than a number you argue about.

Where a real threshold comes from

Three inputs, used together:

What users actually need. Different operations have different expectations. A keystroke autocomplete needs to feel instant (well under a few hundred ms); a search can take a bit longer; a heavy export or report can take seconds and nobody minds, as long as there's feedback. Set the bar per operation against what feels acceptable for that operation, not one global number.

What it does today (the baseline). Measure current production performance before setting a target. If today's p95 is 400ms, a 2s threshold is so loose it would let a 5× regression pass unnoticed. A good threshold is often "no worse than today, ideally a bit better" — anchored to reality, not aspiration. Without a baseline you're guessing.

What the business can tolerate. Some latency genuinely costs money — checkout and search abandonment rise with delay. Those flows deserve tighter thresholds than a low-traffic admin screen. Threshold strictness should track impact.

Set them on the right metric

A threshold on the wrong statistic protects nothing:

  • Use percentiles, not averages. The average lies; a threshold on it lets a miserable tail through. Set the line on p95 (and often p99 for critical paths) so you're bounding the experience of the slow-end users, not the mean.
  • Always pair latency with an error-rate threshold. "p95 < 500ms" means nothing if 8% of requests are failing. A fast error is still an error — gate on both.
  • Tier by criticality. Don't apply one threshold to everything. Critical money paths get strict latency + near-zero errors; background and admin operations get looser bars. Uniform thresholds either over-alert on things that don't matter or under-protect the things that do.

Setting performance thresholds

  • Derive per-operation, not one global number — instant for autocomplete, seconds OK for heavy reports
  • Measure today's production baseline first; anchor the target to it ("no worse than now")
  • Set the threshold on p95/p99, never the average
  • Always pair a latency threshold with an error-rate threshold (e.g. failures < 1%)
  • Make critical/revenue paths strict; loosen for low-impact background operations
  • Sanity-check: would this catch a real regression, and would it avoid firing on normal variance?
  • Write thresholds into the test (e.g. k6 thresholds) so the gate is automatic and consistent

Avoid the two failure modes

A threshold can fail in two opposite ways, and both erode trust in the test. Too loose (the made-up "under 2s" on a 400ms API) lets real regressions sail through — the gate is green while things quietly degrade. Too tight (or set on a noisy average) fires constantly on normal variance, and a gate that cries wolf gets ignored, which is the same as not having one — the performance cousin of a flaky test poisoning the suite. A good threshold sits in the band that catches genuine regressions and tolerates normal noise, derived from what users need and what the system does today. Revisit them as the baseline shifts — a threshold set against last year's performance slowly becomes meaningless. The goal isn't a number that feels safe; it's a line that actually means "if we cross this, users are worse off."

// RELATED QA.CODES RESOURCES


// related

Deep dives·13 June 2026 · 8 min read

p95 latency explained for QA engineers

What p95 actually means, why averages hide the bugs, and how to read a latency distribution as a tester.

performance-testinglatencymetrics
Deep dives·13 June 2026 · 8 min read

Load testing is not the same as performance testing

Load testing is one type of performance test, not the whole thing. A single user can have a performance bug. Match the test (load/stress/spike/soak) to the risk.

performance-testingload-testingconcepts