Q13 of 38 · Performance

How do you correlate response time with backend resource saturation?

PerformanceMidperformancebottlenecksaturationapmobservability

Short answer

Short answer: Run the load test while collecting host and service metrics (CPU, memory, IO, DB pool, GC, queue depth) on synchronised timestamps. Plot latency against each resource — the inflection point where latency rises and a resource hits its limit identifies the bottleneck.

Detail

Methodology:

  1. Set up observability before the test. Datadog/Prometheus/New Relic must be capturing the system's metrics at 1-10s resolution: CPU, memory, disk IO, network, DB connection pool, slow query rate, GC pause time, queue depth, thread pool utilisation.
  2. Synchronise time. Load tool and observability must share NTP clock — a 30-second drift makes correlation impossible.
  3. Run a stepped ramp. Increase load in stages (50, 100, 200, 400 RPS) with a 5-minute hold each. Each plateau gives a stable measurement window.
  4. Plot latency vs. each resource. Either eye-ball overlapped charts in the dashboard, or pull the data into Python/notebooks and compute correlation. The signature you're looking for: latency stays flat through three plateaus, then jumps at the same plateau a resource hits 80-100%.

What to look for:

  • CPU saturated — the app server is the bottleneck. Scale horizontally or optimise hot code.
  • DB connection pool at 100% with idle CPU — increase pool size, kill long-running queries, add read replicas.
  • GC pauses spiking — heap pressure; tune GC, add memory, find the leak.
  • Network egress maxed — payload too large or too chatty; compress, batch, or move closer.
  • Disk IO saturated — log volume, hot temp tables, or unbatched writes.

APM traces add the per-component view: "this request took 2s — 1.6s in DB query, 0.3s in serialisation, 0.1s in app." That's the smoking gun no aggregate metric provides.

// WHAT INTERVIEWERS LOOK FOR

Methodology — stepped ramp, synchronised timestamps, observability across host and service, and using APM traces to confirm. Bonus for naming saturation patterns by resource.

// COMMON PITFALL

Looking only at latency from the load tool side — you see *that* the system slowed but not *why*. Without server-side metrics, every conclusion is a guess.