Q38 of 38 · Performance
How do you define and maintain performance SLOs across multiple services in a growing organisation?
Short answer
Short answer: Define SLOs at the service level, derived from user-facing business requirements rather than infrastructure capabilities. Review and update them quarterly. Tie SLO breaches to an on-call escalation process so they are treated as incidents, not just metrics.
Detail
The common failure mode: SLOs are set once by the first engineer who ran a load test, never reviewed, and gradually become either irrelevant (the system is 10x faster now) or unachievable (the system has grown and the SLO was set against a tiny dataset).
Define SLOs from user behaviour. What latency causes a meaningful increase in abandonment? For a checkout flow, research suggests conversions drop noticeably above 3 s. That informs the SLO — not "what latency can our servers achieve?"
Three layers of SLO:
- Product SLO: user-perceived behaviour ("checkout completes in under 3 s for 95% of users in production"). Owned by product.
- Service SLO: per-service technical target ("payment API p95 under 400 ms at 500 RPS"). Owned by the service team.
- Infrastructure SLO: resource-level targets ("database p99 under 50 ms"). Owned by platform.
Review cadence: quarterly review of actual field data (RUM, APM) versus SLO. If production p95 has been 200 ms for 6 months and the SLO is 500 ms, tighten the SLO — otherwise it provides no signal.
Error budget: when a service consumes its error budget (SLO is breached for more than X% of a rolling window), features freeze and reliability work takes priority. This is the operational teeth that makes SLOs more than aspirational numbers.