Response Time and Performance Assertions

A correct API that takes 8 seconds to respond is, for users, a broken API. Performance regressions creep in slowly — a new join here, an unchecked N+1 there — and tend to escape functional test suites that only assert on correctness. Layering response-time checks into the same tests that already verify behaviour catches these regressions early, while they're still cheap to fix. This lesson covers when to assert on time, how to set thresholds without making your suite flaky, and how to think about distribution-level metrics like p95.

Why include response time in functional tests

Two reasons:

Catch regressions early. A query that suddenly takes 5× longer is almost always a code change away from being caught — if you have an assertion that screams when the threshold is crossed.
Free signal. You're already making the request. Asserting response.elapsed < 500ms adds nothing to test runtime.

What this isn't: a substitute for proper load testing. A response-time check on a single request from CI doesn't tell you whether the API survives 1,000 concurrent users — that's a job for k6, Gatling, or Locust against a dedicated environment. Functional response-time assertions are an early-warning system, not a load test.

Setting realistic thresholds

The hardest part is picking a threshold that's tight enough to catch regressions and loose enough not to flake.

Rough starting points by endpoint type:

Endpoint type	Reasonable target
Health check (`/health`)	`< 100ms`
Simple read (`GET /users/123`)	`< 300ms`
List with pagination (`GET /products?limit=20`)	`< 500ms`
Search or aggregate query	`< 1000ms`
Create / update	`< 800ms`
File upload (small)	`< 3000ms`
Report or analytics endpoint	`< 5000ms`

Two rules of thumb:

Set thresholds with margin. If your endpoint typically responds in 200ms, set the assertion at 500ms or 800ms — not 250ms. The margin absorbs noisy CI runners and slightly slow days.
Round to readable numbers. < 500ms reads better than < 487ms and isn't any less informative.

A useful workflow: capture baseline timings for each endpoint over a week, take the worst observed value, double it, round up. That's your starting threshold.

How to measure

Most clients expose total request time directly:

curl — %{time_total} in the output format string:

curl -w "Total: %{time_total}s\n" -o /dev/null -s https://api.example.com/users
# Total: 0.184s

Python requests — response.elapsed:

response = requests.get(url, timeout=5)
assert response.elapsed.total_seconds() < 0.5

JavaScript / fetch — wrap with performance.now():

const start = performance.now();
const response = await fetch(url);
const elapsed = performance.now() - start;
expect(elapsed).toBeLessThan(500);

Postman — pm.response.responseTime (in milliseconds), used in the Tests tab.

These measure round-trip time from the client's perspective, including network. That's exactly what users experience, so it's the right thing to assert on.

What "p95" means and why it matters

A single request's time tells you almost nothing — APIs are noisy. The useful question is "across many requests, how slow is it for the slowest 5% of users?" That number is the p95 latency: 95% of requests are at or below it, 5% are slower.

Response time distribution — 100 sample requests

0–100ms12 requests

100–200ms (p50 here)38 requests

200–300ms28 requests

300–500ms14 requests

500–800ms (p95 here)6 requests

800ms+ (long tail)2 requests

Read it like this:

p50 (median) ≈ 180ms — the typical user experience.
p95 ≈ 600ms — the slowest 5% of users wait this long or longer.
p99 is hidden in that long tail — the worst 1% might be 2-3 seconds.

Single-request assertions tend to track p50. To assert on p95, you need many requests:

import statistics
 
times = [requests.get(url).elapsed.total_seconds() for _ in range(50)]
times.sort()
p95 = times[int(len(times) * 0.95)]
assert p95 < 0.8, f"p95 was {p95}s"

This is heavier than a single-request check and usually lives in a dedicated performance suite, not in PR-blocking CI.

Patterns that work in practice

Three reliable patterns:

Per-test inline assertion — every functional test asserts elapsed < threshold. Cheap, catches regressions on every CI run.
Aggregated post-suite report — collect all timings, log p50/p95/p99 at the end of the suite, compare to last run. No flakiness, dashboard-friendly.
Dedicated perf suite — runs nightly with N requests per endpoint, asserts on p95. Catches slower regressions.

Most teams should start with #1, add #2 once they have a few weeks of data, and reserve #3 for endpoints with explicit SLAs.

What slows responses down

When a threshold fails, the suspects are usually:

Database queries — new joins, missing indexes, N+1 queries.
External service calls — a new dependency on a slow third-party.
Payload size — a new field or larger response.
Caching change — cache disabled, TTL too short.
Server load — noisy neighbour, too many concurrent requests.
Network — staging in a different region, VPN routing.

The first two account for the bulk of regressions. When you suspect a regression, ask the team: "what changed in the data layer?"

Don't over-tighten

Tight thresholds and noisy CI are a recipe for the most demoralising kind of test failure: flake that doesn't represent a real bug. If a test fails 3% of the time on the same code, developers stop trusting it and start "retrying CI" reflexively. That undermines every real signal you'd later want the suite to send.

Practical defenses against flake:

Wide thresholds (3-5× the typical observed time, not 1.2×).
Skip timing assertions in shared CI runners with known noise — run them in a dedicated environment.
Move to aggregated p95 over time, where one slow request doesn't fail the suite.
If a test consistently flakes at the timing assertion, raise the threshold. Investigate only when the trend shifts.

⚠️ Common mistakes

Asserting on a single request as if it represents performance. One request fluctuates wildly. Use averages, p95s, or wide thresholds.
Setting thresholds equal to the observed mean. Half your CI runs will flake. Always leave headroom.
Skipping perf assertions because "we'll add them later." "Later" tends not to come; meanwhile, regressions accumulate undetected. Even loose thresholds catch the worst slips.

🎯 Practice task

Add timing assertions to a real endpoint. 25 minutes.

Pick any endpoint you can hit — public API or your own. Run curl -w "%{time_total}\n" -o /dev/null -s <url> ten times. Note the range.
Compute a sensible threshold: take the slowest observed value and double it. Round to a clean number.
In your favourite language, write a small loop that hits the endpoint 50 times and records elapsed times.
Compute p50 and p95 from the list (sort and pick the right indices). Assert p95 < your threshold.
Try to break it: add time.sleep(0.5) between two requests, or hit a deliberately slow endpoint (https://httpbin.org/delay/2). Confirm the assertion fails the way you expect.
Stretch: capture per-test timings into a JSON file at the end of your suite. Diff today's p95 against yesterday's. That's the seed of a trend monitor.

You can now treat response time as a first-class assertion. The next lesson goes one layer deeper — verifying not just what the API says, but what it actually did.