Q18 of 38 · CI/CD & DevOps
How would you implement zero-downtime testing during a production deployment?
Short answer
Short answer: Smoke tests against the new version in a canary slice, synthetic monitoring for SLO-critical paths, and progressive rollout (1%→10%→100%) with auto-rollback on error-budget burn. Read-only tests run in prod safely; mutating tests need a tagged synthetic tenant.
Detail
Zero-downtime testing means you keep verifying behaviour while users are using the system, without breaking the system in the process.
Layer 1 — canary smoke.
- Deploy to 1-5% of traffic (canary slice).
- Run a focused smoke test fleet against that slice via internal headers or routing rules.
- Watch error rate, latency, and business metrics for the canary vs. baseline.
- Auto-rollback or hold rollout if delta exceeds threshold.
Layer 2 — synthetic monitoring.
- Continuous read-only probes against critical paths (login, search, checkout view) every 30-60 seconds.
- Tools: Datadog Synthetics, Checkly, Pingdom, Grafana Synthetic Monitoring.
- Pages on-call when they fail — fastest signal of user impact.
Layer 3 — progressive rollout with SLO gating.
- 1% → wait 10 min → check SLOs → 10% → wait 30 min → 50% → 100%.
- Budget burn rate gates each step. If error budget for the last hour exceeds 2%, hold.
- Examples: Argo Rollouts, Flagger, LaunchDarkly + custom logic.
Layer 4 — production-safe mutating tests.
- Synthetic tenants: a tagged user/org that's known-test, never real-customer-facing. Their orders go to a sandbox payment provider, their data lives in a side schema, etc.
- Idempotent operations: a test that posts a known product creation request can verify behaviour as long as the post is idempotent and tagged so it's filterable from real data.
- Avoid: anything destructive in shared tables (DROP, mass UPDATE, anything that affects real records).
Layer 5 — fast rollback.
- The deploy system must support sub-minute rollback to the previous version. Without it, all the testing in the world doesn't help when something slips through.
- Rollback should be one command or one button — practised regularly so muscle memory exists when it's 3am.
Anti-pattern: heavy E2E suites running against production every 5 minutes. Real users are exposed to test traffic noise; production data accrues test-tenant junk; cleanup is neverending. Use synthetic monitoring (lightweight, focused) instead.
Senior signal: layered approach, awareness that observability and rollback matter as much as testing, and synthetic-tenant pattern for mutating verification.