Q10 of 38 · Performance
How do you set realistic SLOs for a load test?
Short answer
Short answer: Start from production data — historical p95/p99 latencies and user-impact studies. Set SLOs tighter than current performance to drive improvement, but loose enough to be achievable. Negotiate with product on tail trade-offs, and version SLOs alongside the features they cover.
Detail
Made-up SLOs are worthless. The number "p95 < 500ms" carries weight only if there's a story behind it.
Source 1 — production telemetry. What does p95 currently measure on production? Pull 30 days of data from Datadog/New Relic. If the current p95 is 600ms, an SLO of 500ms is tight-but-attainable; 200ms is fantasy. Setting the SLO at current minus a small improvement (e.g. 10-20% tighter) prevents regression while leaving room to improve.
Source 2 — user-impact research. What latency makes users abandon? Amazon famously found 100ms latency cost 1% in sales; Google found 400ms cut search use by 0.6%. Industry numbers: <100ms feels instant, <1s feels responsive, >3s causes drop-off. Use these to bound the SLO from the user's side.
Source 3 — downstream dependencies. If your service calls a vendor with a 500ms SLA, your own p95 cannot be under 500ms unless you're caching or skipping. Don't promise what dependencies won't deliver.
Negotiation with product — the question "how much money are we willing to spend to drop p99 from 2s to 1s?" is real. Tail latency is expensive (caching, replicas, connection pool tuning, query optimisation). Set the SLO at the right point on the cost/benefit curve.
Version SLOs with features. A new feature with heavier compute may legitimately have a worse SLO than the page it lives on. Document the SLO per endpoint, not as a single global number.