How do you define and maintain performance SLOs across multiple services in a growing organisation?

Question

Accepted Answer

Define SLOs at the service level, derived from user-facing business requirements rather than infrastructure capabilities. Review and update them quarterly. Tie SLO breaches to an on-call escalation process so they are treated as incidents, not just metrics. The common failure mode: SLOs are set once by the first engineer who ran a load test, never reviewed, and gradually become either irrelevant (the system is 10x faster now) or unachievable (the system has grown and the SLO was set against a tiny dataset). Define SLOs from user behaviour. What latency causes a meaningful increase in abandonment? For a checkout flow, research suggests conversions drop noticeably above 3 s. That informs the SLO — not "what latency can our servers achieve?" Three layers of SLO: Product SLO: user-perceived behaviour ("checkout completes in under 3 s for 95% of users in production"). Owned by product. Service SLO: per-service technical target ("payment API p95 under 400 ms at 500 RPS"). Owned by the servic

How do you define and maintain performance SLOs across multiple services in a growing organisation?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR