Performance Testing
What a QA engineer needs to test how a system behaves under load: the test types and what each answers, the metrics that actually matter (hint: not the average), how to design a load test that produces a real signal, and the line between back-end load testing and front-end web performance. Pair this with the API Testing Concepts sheet for the request side.
The test types
"Performance testing" is an umbrella. Each type answers a different question — pick the one that matches your risk.
| Type | Question it answers | Shape of the load |
|---|---|---|
| Load | Does it meet its targets at expected traffic? | Ramp to expected peak, hold |
| Stress | Where does it break, and how? | Push past capacity until it fails |
| Soak (endurance) | Does it degrade over time? | Sustained moderate load for hours |
| Spike | Does it survive a sudden surge? | Sharp jump up, then down |
| Volume | Does it cope with large data? | Big datasets/payloads, not just traffic |
Most teams start with a load test against a target, then add soak (catches memory leaks) and spike (catches autoscaling gaps) as the system matters more.
The metrics that matter
The single most important habit: report percentiles, not averages. An average hides the slow tail that users actually feel.
| Metric | What it tells you |
|---|---|
| Throughput (req/s, TPS) | How much work the system handles |
| Latency / response time | How long each request takes |
| p95 / p99 (percentiles) | The experience of the slowest 5% / 1% — the real signal |
| Error rate | Share of requests failing under load |
| Concurrency / active VUs | How many users are hitting it at once |
Why averages lie:
100 requests: 99 at 100ms, 1 at 5000ms
average = 149ms -> looks fine
p99 = 5000ms -> one in a hundred users waits 5s
Report p95/p99 and error rate. The average is the least useful number.Virtual users, concurrency, and ramp-up
A load tool simulates virtual users (VUs) — concurrent simulated clients. Two knobs shape the test:
- Concurrency — how many VUs run at once (models real simultaneous users). See Concurrent Users.
- Ramp-up — how fast you add VUs. Ramping gradually finds the point where latency degrades; slamming all VUs at once only tells you pass/fail. See Ramp-up Period.
Think in terms of arrival rate (requests per second) where you can — it's more stable than a fixed VU count when response times change mid-test.
Designing a load test
A repeatable shape that produces a real signal:
- Set targets — define the SLOs first (e.g. p95 < 500ms, error rate < 0.1% at 1,000 req/s). A test with no target can't pass or fail.
- Establish a baseline — measure the system at low, known load so you have something to compare against. See Baseline Testing.
- Model realistic traffic — mix the endpoints/journeys real users hit, with realistic think-time and data. One-endpoint hammering misrepresents the system.
- Ramp gradually — increase load in steps and watch where metrics turn.
- Hold and observe — sustain peak long enough to see steady-state behaviour, not just the spike of warm-up.
- Analyse against targets — compare p95/p99 and error rate to the SLOs, not to a vibe.
Use realistic, varied test data — reusing one record hits caches and flatters the result.
Reading results and finding bottlenecks
A failed target is the start, not the answer. Bottlenecks usually sit in one of a few places:
- Application — slow code paths, lock contention, thread/connection-pool exhaustion.
- Database — unindexed queries, the N+1 pattern, connection limits (often the first wall).
- Infrastructure — CPU/memory saturation, network, undersized instances.
- External dependencies — a downstream API or queue that caps your throughput.
Correlate the load tool's metrics with server-side observability (APM, Grafana dashboards) — the load tool tells you that it's slow; the server-side data tells you where. Watch for the knee in the curve: the concurrency level where latency rises sharply while throughput plateaus.
Back-end load vs front-end performance
Two different disciplines, often confused:
| Back-end load testing | Front-end web performance | |
|---|---|---|
| Question | Does the server hold up under load? | Is the page fast for one user? |
| Tools | k6, JMeter, Gatling, Locust | Lighthouse, WebPageTest |
| Metrics | Throughput, p95 latency, error rate | Core Web Vitals (LCP, INP, CLS) |
| Scope | Many simulated users, no real browser | One real browser, render/paint timings |
Both matter: a server that scales perfectly still feels slow if the page renders poorly, and a fast page still fails if the API behind it falls over at 500 users. Test the layer that carries your risk — usually both.
The tool landscape
| Need | Tools |
|---|---|
| Code-scripted load (OSS) | k6 (JS), Gatling (Scala/Java), Locust (Python), Artillery (JS/YAML) |
| GUI / record-based load | JMeter |
| Quick HTTP benchmarking | Vegeta (constant-rate), wrk |
| Enterprise platforms | LoadRunner, NeoLoad, LoadView |
| Front-end / Core Web Vitals | Lighthouse, WebPageTest |
Quick performance testing checklist
- Test type matches the question (load / stress / soak / spike / volume)
- Targets/SLOs defined before the test (p95, error rate, throughput)
- A baseline measured at known low load
- Realistic traffic model — real journeys, think-time, varied data
- Load ramped gradually, then held at peak
- Results judged on percentiles (p95/p99) and error rate, not averages
- Bottleneck located by correlating with server-side observability
- Front-end performance (Core Web Vitals) covered where users feel it
- Soak run for leaks / spike run for surges where the system warrants it
- Results compared against the SLOs, with a clear pass/fail