Q13 of 38 · Performance
How do you correlate response time with backend resource saturation?
PerformanceMidperformancebottlenecksaturationapmobservability
Short answer
Short answer: Run the load test while collecting host and service metrics (CPU, memory, IO, DB pool, GC, queue depth) on synchronised timestamps. Plot latency against each resource — the inflection point where latency rises and a resource hits its limit identifies the bottleneck.
Detail
Methodology:
- Set up observability before the test. Datadog/Prometheus/New Relic must be capturing the system's metrics at 1-10s resolution: CPU, memory, disk IO, network, DB connection pool, slow query rate, GC pause time, queue depth, thread pool utilisation.
- Synchronise time. Load tool and observability must share NTP clock — a 30-second drift makes correlation impossible.
- Run a stepped ramp. Increase load in stages (50, 100, 200, 400 RPS) with a 5-minute hold each. Each plateau gives a stable measurement window.
- Plot latency vs. each resource. Either eye-ball overlapped charts in the dashboard, or pull the data into Python/notebooks and compute correlation. The signature you're looking for: latency stays flat through three plateaus, then jumps at the same plateau a resource hits 80-100%.
What to look for:
- CPU saturated — the app server is the bottleneck. Scale horizontally or optimise hot code.
- DB connection pool at 100% with idle CPU — increase pool size, kill long-running queries, add read replicas.
- GC pauses spiking — heap pressure; tune GC, add memory, find the leak.
- Network egress maxed — payload too large or too chatty; compress, batch, or move closer.
- Disk IO saturated — log volume, hot temp tables, or unbatched writes.
APM traces add the per-component view: "this request took 2s — 1.6s in DB query, 0.3s in serialisation, 0.1s in app." That's the smoking gun no aggregate metric provides.
// WHAT INTERVIEWERS LOOK FOR
Methodology — stepped ramp, synchronised timestamps, observability across host and service, and using APM traces to confirm. Bonus for naming saturation patterns by resource.
// COMMON PITFALL
Looking only at latency from the load tool side — you see *that* the system slowed but not *why*. Without server-side metrics, every conclusion is a guess.