Soak Testing — Long-Duration Stability — Performance Testing with K6

A soak test runs normal load for an extended period — 4 to 24 hours. It finds defects that are invisible in a 30-minute load test: memory that leaks slowly, database connections that are opened but never returned, log files that fill disks, caches that grow without eviction. These problems only appear when time passes.

The stable vs degrading signature

Soak test metric signatures

	Stable system	Degrading system
http_req_duration p(95)	Consistent throughout: 180ms at hour 1, 185ms at hour 8	Slowly climbing: 180ms at hour 1, 320ms at hour 4, 1200ms at hour 8
http_req_failed rate	Near-zero throughout: 0.0%–0.1% across all hours	Initially zero, then slowly rising: 0% at hour 2, 2% at hour 6, 12% at hour 9 as OOM kills begin
Iteration count (rate)	Constant: 45 iterations/s throughout the test	Slowly falling: 45/s at start, 30/s at hour 5, 18/s at hour 8 as each VU waits longer
http_req_duration trend over time	Flat line in Grafana — no upward drift	Slow upward slope in Grafana — the classic memory leak or connection leak signature

The soak test pattern

export const options = {
  stages: [
    { duration: '5m',  target: 50 },   // ramp up to normal load
    { duration: '8h',  target: 50 },   // hold for 8 hours
    { duration: '5m',  target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<1000'],  // allow higher threshold — focus is on drift, not absolute speed
    http_req_failed:   ['rate<0.05'],
  },
};

Run this overnight. Eight hours is enough to reveal most memory leaks and connection pool degradation patterns. For compliance testing or release candidate validation, 24 hours is more appropriate.

What to monitor alongside K6

K6 metrics alone are not enough for a soak test. You need server-side visibility:

Application server:

Heap memory usage over time — a steady climb with no drops between GC cycles means a leak
GC pause duration — growing pause times indicate heap pressure
Active thread count — a slowly growing thread count means threads are not being returned to the pool

Database server:

Active connection count — should stay constant; a slow increase means connections are not being closed
Query queue depth — queries waiting for a connection; should be near zero

Infrastructure:

Disk usage on log volumes — log files that are not rotated fill disks
File descriptor count — open files and sockets that are not closed

When K6 shows p(95) climbing after 5 hours, correlate the timestamp with server metrics to identify whether the cause is heap pressure, connection pool exhaustion, or disk I/O slowdown.

Using a Trend metric to track drift

A custom Trend metric on the complete iteration duration lets you plot drift over time with Grafana's time series view:

import { Trend } from 'k6/metrics';
import http from 'k6/http';
import { check, sleep } from 'k6';
 
const iterationDuration = new Trend('iteration_duration_ms');
 
export default function (data) {
  const start = Date.now();
 
  const res = http.get('https://api.example.com/dashboard', {
    headers: { Authorization: `Bearer ${data.token}` },
    tags: { name: 'Dashboard' },
  });
  check(res, { 'dashboard ok': (r) => r.status === 200 });
 
  sleep(Math.random() * 2 + 1);
 
  iterationDuration.add(Date.now() - start);
}

In Grafana, plot iteration_duration_ms as a time series with a 5-minute moving average. A flat line means the system is stable. An upward slope — even a gentle one — confirms degradation over time.

When to run a soak test

Soak tests are expensive (8+ hours of infrastructure time) and should be targeted:

Pre-release for memory-intensive features — new in-memory caches, new background workers, new batch jobs
After fixing a memory leak — verify the fix holds over time
When adding connection pooling — confirm connections are properly returned under sustained load
Before SLA commitments — if you are signing an SLA for 99.9% uptime under sustained load, run the test that covers it

Do not run soak tests for every PR. Run them when the change touches resource lifecycle: connection management, caching, file handles, background threads.

Interpreting results

Rising p(95) with stable error rate: Application is slowing down but not failing. Common cause: growing in-memory data structure (unbounded cache, accumulating audit log, growing event queue). Find the data structure; add eviction.

Rising error rate with stable p(95): Requests that complete are fast, but an increasing proportion time out or get connection refused. Common cause: connection pool exhausting over time as connections leak. Find where connections are opened without finally blocks or equivalent cleanup.

Falling iteration rate with rising p(95): Each VU takes longer per iteration, so fewer iterations per unit time. Classic memory pressure: as heap fills, GC pauses extend, everything slows. Fix: profile heap allocations, add eviction, or increase heap size.

Stable metrics for 6 hours, then sudden collapse: A resource quota or limit was reached — log disk full, file descriptor limit hit, or a background job that runs nightly triggered resource contention. Correlate the timestamp with cron job schedules and infrastructure events.

⚠️ Common mistakes

Running soak tests against staging with small data volumes. A memory leak triggered by processing 10,000 database records per iteration will not appear if staging has 100 records. The defect emerges at data scale. Use production-volume data snapshots for soak tests.
Not monitoring server-side metrics. K6 tells you when http_req_duration started climbing. Without server-side metrics (memory, connections, GC), you do not know why. K6 data without APM data produces an incomplete diagnosis.
Setting the VU count too high for a soak test. A soak test runs at normal expected load — not peak. If you run a soak test at 2× normal load, you are combining a stress test with a soak test. The results are harder to interpret. Run separate tests for each concern.
Treating passing thresholds as "system is stable." Thresholds are point-in-time evaluations at the end of the test. A system that passes p(95)<1000 overall can still have a rising slope — hour 1 at 300ms and hour 8 at 950ms both fall under the threshold. Use time-series visualisation in Grafana to see the trend, not just the final aggregate.

🎯 Practice task

Run a short soak test and measure drift. 45 minutes.

Use https://test.k6.io — Grafana's public K6 test endpoint.

Write a soak test with a short duration for practice: ramp to 5 VUs over 30s, hold for 10 minutes, ramp to 0 over 30s.
Add a custom Trend metric: const iterDuration = new Trend('iter_duration_ms'). Record Date.now() at the start of the default function and call iterDuration.add(Date.now() - start) at the end.
Add sleep(Math.random() * 2 + 1) inside the default function.
Add thresholds: iter_duration_ms: ['p(95)<5000'] and http_req_failed: ['rate<0.05'].
Run the test. Observe whether iter_duration_ms p(95) is stable or drifts during the 10-minute hold.
Look at the output every 2 minutes. Record http_reqs rate and http_req_duration avg at minutes 2, 4, 6, 8, and 10. Do they remain stable? This is the baseline reading you would compare against server-side memory and connection metrics in a production soak test.