Baseline & Bottleneck Analysis

Analyse a set of performance test results, establish a baseline, identify bottleneck hypotheses with supporting evidence, and recommend follow-up tests — without access to source code or infrastructure metrics.

Role

Performance QA

Difficulty

Advanced

Time limit

90 min

Scenario

You have received the results of three load test runs against the ShopFlow checkout API. Run A was a 50-VU baseline captured two weeks ago. Run B is the current run at 50 VUs after a deployment. Run C is a 200-VU stress run from yesterday. You did not run the tests yourself — you are analysing the exported result data. Your task is to establish a baseline from Run A, compare Run B to it, identify likely bottlenecks from the Run C results, and recommend the next tests the team should run.

Requirements

1.Summarise the Run A baseline: state the key metrics (P50, P90, P95, P99, throughput, error rate) and declare it as the accepted baseline with a brief justification
2.Compare Run B to the baseline: calculate the percentage change for P90 and P95, state whether performance has regressed or improved, and identify which endpoint(s) show the most significant change
3.Analyse Run C (200-VU stress): identify where error rate or response time degrades non-linearly (indicating saturation), and state at what approximate VU level the system appears to reach its limit
4.Formulate at least three bottleneck hypotheses based on the Run C data — each hypothesis must cite specific evidence from the results (metric values, endpoint names, error types) rather than general guesses
5.List at least three follow-up questions you would ask the engineering team to validate or rule out each bottleneck hypothesis
6.Recommend at least two specific follow-up tests (with load levels and focus areas) to isolate the most likely bottleneck
7.Write a one-paragraph risk summary suitable for a stakeholder update: what the results mean for the planned go-live, and what must be resolved before launch

Starter data

›Run A — Baseline (50 VUs, 10-min steady state, pre-deployment): GET /checkout/summary P50=120ms P90=310ms P95=420ms P99=890ms; POST /checkout/submit P50=310ms P90=780ms P95=1050ms P99=2100ms; throughput=48.2 req/s; error rate=0.1%
›Run B — Post-deployment (50 VUs, 10-min steady state, new release): GET /checkout/summary P50=125ms P90=340ms P95=510ms P99=1400ms; POST /checkout/submit P50=315ms P90=820ms P95=1380ms P99=3500ms; throughput=47.8 req/s; error rate=0.3%
›Run C — Stress test (200 VUs, 5-min steady state, ramp 2 min): GET /checkout/summary P50=180ms P90=620ms P95=1100ms P99=4200ms; POST /checkout/submit P50=1200ms P90=4800ms P95=8500ms P99=18000ms; throughput=62.1 req/s; error rate=8.4% (breakdown: 6.1% HTTP 503, 1.8% HTTP 504, 0.5% connection timeout)
›Additional context: the deployment in Run B included a new discount-calculation service added to the POST /checkout/submit call path; the 503 errors in Run C all originate from POST /checkout/submit; the infrastructure team reports no CPU or memory alerts were triggered during Run C
›SLA: P95 ≤ 1 s for all checkout endpoints at 100 VUs peak; error rate < 0.5% at peak load

Expected deliverables

✓A baseline summary table: metric name, Run A value, and a one-sentence justification for accepting it as the baseline
✓A Run A vs Run B comparison table: metric, Run A value, Run B value, percentage change, and a regression/improvement verdict per metric; a sentence naming the endpoint with the most significant regression
✓A Run C analysis section: a metrics table for all four percentiles + throughput + error rate; an identification of the saturation point with supporting evidence; an error breakdown table
✓A bottleneck hypotheses section: at least three hypotheses, each formatted as: Hypothesis — Evidence — Likelihood (High/Medium/Low)
✓A follow-up questions list (minimum three questions addressed to the engineering team)
✓A follow-up test recommendations list (minimum two, each specifying load level, endpoint focus, and what the test would prove or disprove)
✓A stakeholder risk summary paragraph (4–6 sentences): go-live risk level, what is blocking, and what must be resolved

Evaluation rubric

Dimension	What reviewers look for
Evidence-based analysis	Every bottleneck hypothesis must cite specific numbers from the result data — not general statements like 'the database might be slow'. A strong hypothesis reads: 'POST /checkout/submit P95 degraded 981% between 50 VUs and 200 VUs (1050 ms → 8500 ms) while GET /checkout/summary degraded only 162% (420 ms → 1100 ms), suggesting the bottleneck is within the POST handler or its dependencies, not the network or load balancer layer.' Hypotheses without evidence citations are not acceptable.
Correct metric reading	Are percentiles read accurately from the data? P95 means 95% of requests completed within that time — the worst 5% were slower. Is the throughput plateau (62.1 req/s at 200 VUs vs. 48.2 req/s at 50 VUs — only 29% increase for a 4× VU count) correctly identified as a sign of saturation? Is the error breakdown (503 vs 504 vs timeout) read as distinct failure modes pointing to different root causes?
Plausible bottleneck reasoning	Do the hypotheses match the evidence? Given that: (a) P95 of POST /checkout/submit increased 8× under stress while GET stayed relatively stable; (b) errors are 503s from the POST path; (c) the new discount-calculation service was added to the POST path; (d) no CPU/memory alerts fired — the most plausible hypotheses are connection pool exhaustion in the discount service, or an external API call in the POST path with no timeout. A hypothesis blaming the database for GET slowness is not well-supported by this data.
Actionable next steps	Are follow-up test recommendations specific and testable? 'Run a load test' is not specific enough. A good recommendation reads: 'Run an isolation test targeting POST /checkout/submit alone at 50 VUs and 100 VUs, with the discount-calculation service call mocked, to determine whether the bottleneck is inside or outside the new service.' Follow-up questions should be answerable by the engineering team and should directly inform a specific hypothesis.
Risk communication for stakeholders	Is the risk summary written for a non-technical audience (product manager, engineering lead) without omitting the key facts? It should name the specific risk ('POST /checkout/submit exceeds the 1-second P95 SLA at 200 VUs'), the go-live implication ('the planned launch at peak 200 VUs will breach the SLA'), and the path forward ('bottleneck must be isolated and resolved before launch; one additional load test run is required after the fix'). Avoid jargon-heavy summaries that obscure the severity.

Sample solution outline

›Baseline summary (Run A): P90=310ms, P95=420ms, throughput=48.2 req/s, error rate=0.1% — accepted as baseline because: stable 10-minute run at target VU count, error rate well within SLA, results are reproducible pre-deployment reference
›Run A vs Run B regression: P95 GET /checkout/summary +21% (420→510ms — within SLA); P95 POST /checkout/submit +31% (1050→1380ms — still within 1s SLA at 50 VUs but trending up); P99 POST +67% (2100→3500ms) — most significant change is POST P99, suggesting the new discount-calculation service adds tail latency under moderate load
›Run C saturation analysis: throughput increased only 29% (48→62 req/s) when VUs increased 4× (50→200) — classic saturation signal; POST /checkout/submit P95 = 8500ms — 8× above SLA; error rate 8.4% (6.1% HTTP 503) — all originating from POST path; no CPU/memory alerts fired suggests application-layer bottleneck (thread pool, connection pool, or external API) not infrastructure
›Hypothesis 1: Connection pool exhaustion in the discount-calculation service — Evidence: 503 errors originate exclusively from POST /checkout/submit (the only endpoint calling discount service); 503 = upstream unavailable, consistent with pool exhaustion; no CPU alert = resource not the issue. Likelihood: High
›Hypothesis 2: Unguarded synchronous external API call in checkout submit path — Evidence: P99 of POST grew from 2100ms (50VUs) to 18000ms (200VUs) — a 9× increase for a 4× VU increase suggests queuing behind a single bottleneck rather than linear degradation. Likelihood: Medium
›Hypothesis 3: Database connection pool at the checkout service level — Evidence: GET /checkout/summary shares the same DB but showed only 162% P95 degradation vs 710% for POST; however no DB alerts fired. The asymmetry is more consistent with the discount service hypothesis than DB. Likelihood: Low
›Follow-up questions: (1) What is the connection pool size configured for the discount-calculation service, and what does its health endpoint show during load? (2) Are the 503 responses returned by the discount service itself or by the load balancer in front of it? (3) Does POST /checkout/submit have a timeout configured for the discount service call, and if so what is it?
›Follow-up tests: (1) Isolation test — run POST /checkout/submit alone at 100 VUs with discount service mocked: if P95 drops to < 500ms, the bottleneck is confirmed inside the discount service. (2) Stepped load test — ramp from 50 to 200 VUs in 25-VU increments, 3 minutes per step: identify the exact VU count at which error rate first exceeds 1% to find the tipping point
›Stakeholder risk summary: The post-deployment load test shows a 31% P95 regression on the checkout submit path at 50 VUs, and the 200-VU stress test reveals the system cannot sustain the planned peak load — POST /checkout/submit P95 reaches 8,500ms against a 1,000ms SLA and generates an 8.4% error rate. The most likely cause is a resource bottleneck introduced with the new discount-calculation service. Launch at the planned 200-VU peak is not recommended until this bottleneck is isolated and resolved. The team needs one isolation test and one regression run after the fix — estimated 2–3 days of engineering investigation and re-test time.

Common mistakes

Citing only absolute values without percentage changes — '510ms vs 420ms' is harder to assess than '+21%; within SLA' — always express regression as a percentage relative to the baseline
Treating the Run C error rate as a generic 'things went wrong' finding rather than decomposing the error types — 503 (upstream unavailable), 504 (gateway timeout), and connection timeout point to different root causes and require different investigations
Writing hypotheses without evidence citations — 'the database might be slow' is not a hypothesis; 'POST /checkout/submit P95 degraded 8× while GET degraded only 2×, suggesting the bottleneck is in the POST-specific call path, not shared infrastructure' is a hypothesis
Ignoring the throughput plateau — 62 req/s at 200 VUs vs 48 req/s at 50 VUs (only 29% gain for 4× the load) is a critical saturation signal; candidates who only look at response time miss this
Recommending 'run more load tests' without specifying VU level, endpoint focus, and what the test would prove — a follow-up test recommendation must be specific enough to execute without further clarification
Writing a stakeholder summary that only describes the data without stating the business implication — the stakeholder needs to know: can we launch? if not, what must change? in how long?
Assuming infrastructure is the bottleneck because the error rate is high — in this scenario, no CPU or memory alerts fired, which is a strong counter-signal; the bottleneck is application-layer (connection pool, thread pool, external call), not server capacity

Submission checklist

Baseline summary table with Run A metrics and acceptance justification
Run A vs Run B comparison table with percentage changes and regression verdict per metric
Run C analysis section: metrics table, saturation point identification, error breakdown
Minimum three bottleneck hypotheses with evidence citations and likelihood rating
Minimum three follow-up questions for the engineering team
Minimum two follow-up test recommendations with load level and focus area
Stakeholder risk summary paragraph naming the risk, SLA breach, and path forward
No unsupported hypotheses — every claim must reference specific metric values from the data

Extension ideas

+Build a comparison chart (table or ASCII graph) showing P95 response time across Run A, Run B, and Run C for each endpoint — making the degradation trend visible at a glance
+Write a second stakeholder summary variant: one for the engineering lead (technical detail, root-cause hypotheses) and one for the product manager (business risk, go-live recommendation, timeline) — practising audience-appropriate communication
+Add a 'what a good result looks like' section: define what the system's Run D results must show (after the fix) to be considered resolved — specific threshold values and acceptance criteria — turning the analysis into an acceptance test plan