Performance Quality Gates with Thresholds

8 min read

A quality gate is a pass/fail criterion that prevents a build from progressing when performance degrades. K6 thresholds are quality gates: they define acceptable performance, and when any threshold fails, K6 exits with code 108 — a signal CI pipelines detect and act on without any custom parsing.

Thresholds in a CI context

A complete quality gate configuration

export const options = {
  vus: 50,
  duration: '10m',
  thresholds: {
    // SLA: overall latency
    'http_req_duration': ['p(95)<500', 'p(99)<1000'],
 
    // SLA: error rate
    'http_req_failed': ['rate<0.001'],   // less than 0.1%
 
    // Risk-driven: critical paths get stricter thresholds
    'http_req_duration{name:Login}':    ['p(99)<300'],
    'http_req_duration{name:Payment}':  ['p(95)<800', 'p(99)<1500'],
    'http_req_duration{name:Search}':   ['p(95)<400'],
 
    // Throughput gate: must sustain at least 100 RPS
    'http_reqs': ['rate>100'],
 
    // Custom business metric: at least 95% of orders must succeed
    'order_success_rate': ['rate>0.95'],
 
    // Check pass rate: 99%+ of assertions must hold
    'checks': ['rate>0.99'],
  },
};

When any of these thresholds fails, K6 exits with code 108. In GitHub Actions:

- name: Run load test
  uses: grafana/k6-action@v0.3.1
  with:
    filename: tests/load-test.js
  # The step fails automatically if K6 exits with 108
  # No custom parsing needed — CI sees a non-zero exit code

The CI job fails. The PR is blocked. The deploy does not proceed.

Threshold philosophy: setting values that matter

Too lenient: p(95)<5000 (5 seconds). The test always passes. Developers do not trust it because they know it would pass even if the system were broken. Quality gates that never fail are not quality gates.

Too strict: p(95)<50 (50ms). The test always fails unless you are testing a local in-memory cache. Developers ignore the failure because it has nothing to do with their change. False positives destroy confidence in the gate.

Calibrated: Set thresholds to 20–30% above your current measured p95 during steady-state load. If the system currently runs at p95=180ms, set p(95)<240. This catches regressions while tolerating normal variance. Update thresholds when performance genuinely improves.

Collaborative threshold ownership

Thresholds represent commitments. Effective quality gates are set collaboratively:

  • Product / business: defines user-facing SLAs ("checkout must respond in under 1 second 95% of the time")
  • Engineering: translates SLAs into measurable threshold expressions and validates feasibility
  • QA: writes the thresholds, maintains the test, ensures CI enforces them
  • All stakeholders: review threshold values in code review — they are as important as functional tests

Thresholds checked into version control, reviewed like code, and updated intentionally are more trustworthy than values set once and forgotten.

Tag-based thresholds for granular gates

Apply different quality standards to different endpoint categories using tags:

export const options = {
  thresholds: {
    // Critical user-facing paths
    'http_req_duration{category:critical}': ['p(95)<200'],
 
    // Standard API endpoints
    'http_req_duration{category:standard}': ['p(95)<500'],
 
    // Reporting and export endpoints (users expect them to be slower)
    'http_req_duration{category:reports}': ['p(95)<3000'],
  },
};
 
export default function (data) {
  // Tag each request with its category
  http.get('https://api.example.com/health', {
    tags: { name: 'HealthCheck', category: 'critical' },
  });
 
  http.get('https://api.example.com/orders', {
    tags: { name: 'ListOrders', category: 'standard' },
  });
 
  http.post('https://api.example.com/reports/export', null, {
    tags: { name: 'ExportReport', category: 'reports' },
  });
}

A slow report export does not fail the checkout threshold. The gates are independent.

abortOnFail for load tests in CI

In CI, a load test running to completion while error rates are at 40% is wasteful — the result is clear before the test ends. Use abortOnFail to terminate early:

export const options = {
  stages: [
    { duration: '3m', target: 50 },
    { duration: '10m', target: 50 },
    { duration: '2m', target: 0 },
  ],
  thresholds: {
    http_req_failed: [{
      threshold: 'rate<0.05',
      abortOnFail: true,
      delayAbortEval: '2m',   // give system time to warm up first
    }],
    http_req_duration: [{
      threshold: 'p(95)<2000',
      abortOnFail: true,
      delayAbortEval: '3m',
    }],
  },
};

delayAbortEval: '2m' prevents early termination during ramp-up. A CI test that aborts at minute 5 of a 15-minute run because of a real failure saves 10 minutes of CI time and still reports the failure correctly.

⚠️ Common mistakes

  • Thresholds that have never failed. If your threshold has never triggered a CI failure since you added it, it is probably too lenient. Review threshold values periodically against your current measured performance.
  • Missing delayAbortEval on CI thresholds. During ramp-up, error rates and latency are naturally elevated. Without a delay, abortOnFail terminates the test in the first 30 seconds before the system reaches steady state — a false positive.
  • No threshold on custom metrics. If you track order_success_rate as a custom metric but do not add a threshold, the CI job passes even if 50% of orders are failing. Every custom business metric that measures a KPI should have a threshold.
  • Setting the same threshold for all endpoints. A 2-second threshold on a login endpoint is unacceptable if users abandon your app after 1 second. Use tagged thresholds to match each endpoint's business priority.

🎯 Practice task

Design and implement quality gates for a multi-endpoint test. 35 minutes.

Use https://jsonplaceholder.typicode.com.

  1. Write a script with vus: 5, duration: '2m' that calls /posts (tagged category:read), /users (tagged category:read), and POST /posts (tagged category:write).
  2. Add these thresholds:
    • 'http_req_duration{category:read}': ['p(95)<400']
    • 'http_req_duration{category:write}': ['p(95)<600']
    • 'http_req_failed': ['rate<0.01']
    • 'checks': ['rate>0.99']
  3. Add checks to each request (status code, body content).
  4. Run and observe all thresholds pass. Note the actual p(95) values.
  5. Set the read threshold to p(95)<1 to force a failure. Run again — verify exit code 108.
  6. Add abortOnFail: true, delayAbortEval: '30s' to the failing threshold. Run again and observe whether the test terminates early or completes.
  7. Reflect: what would your team's real SLA thresholds be for these endpoint categories? Write a comment in the script explaining the reasoning behind each threshold value.

// tip to track lessons you complete and pick up where you left off across devices.