A single test run is a snapshot. It tells you whether the system met its SLAs today. A trend across multiple runs tells you whether performance is stable, improving, or slowly degrading — and it catches regressions that a single snapshot cannot see.
Why trends matter
Trend patterns and what they mean
| What you see in Grafana / comparison data | What to do | |
|---|---|---|
| Stable | p(95) consistently 180–210ms across 30 daily runs. Normal variance. | No action needed. Document the baseline. Continue monitoring. |
| Sudden jump (regression) | p(95) was 200ms Monday, 580ms Tuesday. Threshold fails on Tuesday. | Bisect Tuesday's commits. The regression is in one PR. Revert or fix before production. |
| Slow climb (degradation) | p(95) goes 200ms → 220ms → 245ms → 280ms → 320ms over 5 weeks. Never crosses threshold. | This is the most dangerous pattern — no threshold trips, but performance is steadily worsening. Check for memory leaks, unbounded caches, growing tables. |
| Improvement | p(95) drops from 400ms to 180ms after an optimisation is deployed. | Update the baseline to lock in the improvement. Tighten the threshold to match the new normal. |
Storing baselines in git
The simplest approach: commit a JSON summary from a known-good run and compare future runs against it.
# Establish baseline
k6 run --summary-export=baselines/load-test-baseline.json tests/load-test.js
# Commit the baseline
git add baselines/load-test-baseline.json
git commit -m "perf: update load test baseline (p95=210ms)"The --summary-export flag writes the final aggregated metrics (same data as handleSummary's data parameter) to a JSON file without streaming every individual sample.
Comparing runs in CI
A GitHub Actions workflow that compares the current run against the committed baseline:
- name: Run load test
uses: grafana/k6-action@v0.3.1
with:
filename: tests/load-test.js
env:
K6_SUMMARY_EXPORT: current-run.json
- name: Compare against baseline
run: |
node scripts/compare-baseline.js \
baselines/load-test-baseline.json \
current-run.json \
--tolerance 0.20The comparison script (scripts/compare-baseline.js) checks whether current metrics are within 20% of the baseline:
const baseline = JSON.parse(fs.readFileSync(process.argv[2]));
const current = JSON.parse(fs.readFileSync(process.argv[3]));
const tolerance = parseFloat(process.argv[4].replace('--tolerance ', '')) || 0.20;
const metrics = ['http_req_duration', 'http_reqs'];
let hasRegression = false;
for (const metric of metrics) {
const baseP95 = baseline.metrics[metric]?.values['p(95)'];
const currP95 = current.metrics[metric]?.values['p(95)'];
if (baseP95 && currP95 && currP95 > baseP95 * (1 + tolerance)) {
console.error(`REGRESSION: ${metric} p95 was ${baseP95.toFixed(0)}ms, now ${currP95.toFixed(0)}ms (${((currP95/baseP95 - 1) * 100).toFixed(1)}% slower)`);
hasRegression = true;
}
}
process.exit(hasRegression ? 1 : 0);Automated regression detection inside the K6 script
Alternatively, embed the comparison inside handleSummary — no external script required:
import { textSummary } from 'https://jslib.k6.io/k6-summary/0.0.2/index.js';
import { htmlReport } from 'https://raw.githubusercontent.com/benc-uk/k6-reporter/main/dist/bundle.js';
const BASELINE_P95_MS = 250; // from last known-good run
const TOLERANCE = 1.20; // 20% regression tolerance
export function handleSummary(data) {
const currentP95 = data.metrics['http_req_duration']?.values['p(95)'];
const regressionThreshold = BASELINE_P95_MS * TOLERANCE;
const regressionAlert = currentP95 > regressionThreshold
? `REGRESSION: p95 is ${currentP95.toFixed(0)}ms — exceeds baseline ${BASELINE_P95_MS}ms + 20% tolerance (${regressionThreshold.toFixed(0)}ms)`
: `OK: p95 is ${currentP95.toFixed(0)}ms — within baseline tolerance`;
return {
'report.html': htmlReport(data),
'regression-check.txt': regressionAlert,
stdout: textSummary(data, { indent: ' ', enableColors: true }),
};
}In CI, read regression-check.txt and fail the pipeline if it starts with REGRESSION::
- name: Check for regression
run: |
if grep -q "^REGRESSION:" regression-check.txt; then
cat regression-check.txt
exit 1
fi
cat regression-check.txtTrend dashboards in Grafana
When metrics are streamed to InfluxDB across multiple test runs, Grafana can show multi-run trend panels:
- "p95 over last 30 test runs" — each data point is one test run's p95. A flat line means stability; a rising line means degradation.
- "Throughput per build" — cross-reference with your deployment log to see whether each deploy maintained or changed RPS capacity.
- "Error rate across weekly stress tests" — weekly stress test results on one panel; spot when the breaking point VU count changes.
To distinguish runs on Grafana, add a test run identifier tag when streaming:
k6 run \
--out influxdb=http://localhost:8086/k6 \
--tag testRun=$(date +%Y%m%d-%H%M) \
--tag gitSha=$(git rev-parse --short HEAD) \
tests/load-test.jsThe testRun and gitSha tags appear on every metric point, making it possible to filter and compare individual runs in Grafana.
When to update baselines
Baselines should be updated intentionally — not automatically overwritten on every passing run:
- After a performance improvement: update the baseline to lock in the gain, then tighten the threshold
- After a deliberate architectural change: a new database layer might change baseline latency; document and re-establish
- Never automatically: auto-updating baselines on every passing run defeats the purpose — a gradual regression never triggers because the baseline moves with it
Treat baseline updates like dependency version bumps: intentional, reviewed, and merged via pull request.
⚠️ Common mistakes
- Auto-updating baselines on every CI run. If your CI workflow updates the baseline file after every passing run, a 5% performance regression over 10 runs looks like 10 passing runs. Baselines must be updated manually and intentionally.
- Using averages instead of percentiles for baselines. A baseline p(95) of 200ms with a tolerance of 20% flags anything above 240ms. A baseline average of 150ms looks similar but misses tail latency — the worst 5% of users might be at 800ms. Always baseline on p95 or p99.
- Treating a threshold failure as the only signal. The slow-climb degradation pattern — 5% worse per week — never crosses a threshold set 30% above baseline. Add trend visualisation (Grafana) alongside thresholds. Thresholds catch step changes; trends catch gradual drift.
🎯 Practice task
Build a baseline comparison workflow. 35 minutes.
Use https://test.k6.io.
- Write a K6 script with
vus: 10, duration: '2m'. AddhandleSummarythat writescurrent-run.jsonusingJSON.stringify(data, null, 2). - Run the test. Copy
current-run.jsontobaselines/load-test-baseline.json. Examine the file — find thehttp_req_durationmetric and itsp(95)value. - Write a Node.js script
compare.js(runs outside K6) that:- Reads both JSON files
- Extracts
p(95)from each - Prints
PASSorREGRESSIONbased on 20% tolerance - Exits with code 1 on regression
- Run
node compare.js baselines/load-test-baseline.json current-run.json. It should reportPASS(same run). - Artificially modify the baseline to have a lower p95 (e.g., divide by 2). Run the comparison again — verify it reports
REGRESSION. - Add
--tag testRun=$(date +%Y%m%d)to your K6 run command. Examine how the tag appears in the JSON output. Describe in a comment how you would use this tag in Grafana to filter for a specific run.