Q18 of 38 · CI/CD & DevOps

How would you implement zero-downtime testing during a production deployment?

CI/CD & DevOpsSeniorci-cddeploymentcanarysyntheticzero-downtime

Short answer

Short answer: Smoke tests against the new version in a canary slice, synthetic monitoring for SLO-critical paths, and progressive rollout (1%→10%→100%) with auto-rollback on error-budget burn. Read-only tests run in prod safely; mutating tests need a tagged synthetic tenant.

Detail

Zero-downtime testing means you keep verifying behaviour while users are using the system, without breaking the system in the process.

Layer 1 — canary smoke.

  • Deploy to 1-5% of traffic (canary slice).
  • Run a focused smoke test fleet against that slice via internal headers or routing rules.
  • Watch error rate, latency, and business metrics for the canary vs. baseline.
  • Auto-rollback or hold rollout if delta exceeds threshold.

Layer 2 — synthetic monitoring.

  • Continuous read-only probes against critical paths (login, search, checkout view) every 30-60 seconds.
  • Tools: Datadog Synthetics, Checkly, Pingdom, Grafana Synthetic Monitoring.
  • Pages on-call when they fail — fastest signal of user impact.

Layer 3 — progressive rollout with SLO gating.

  • 1% → wait 10 min → check SLOs → 10% → wait 30 min → 50% → 100%.
  • Budget burn rate gates each step. If error budget for the last hour exceeds 2%, hold.
  • Examples: Argo Rollouts, Flagger, LaunchDarkly + custom logic.

Layer 4 — production-safe mutating tests.

  • Synthetic tenants: a tagged user/org that's known-test, never real-customer-facing. Their orders go to a sandbox payment provider, their data lives in a side schema, etc.
  • Idempotent operations: a test that posts a known product creation request can verify behaviour as long as the post is idempotent and tagged so it's filterable from real data.
  • Avoid: anything destructive in shared tables (DROP, mass UPDATE, anything that affects real records).

Layer 5 — fast rollback.

  • The deploy system must support sub-minute rollback to the previous version. Without it, all the testing in the world doesn't help when something slips through.
  • Rollback should be one command or one button — practised regularly so muscle memory exists when it's 3am.

Anti-pattern: heavy E2E suites running against production every 5 minutes. Real users are exposed to test traffic noise; production data accrues test-tenant junk; cleanup is neverending. Use synthetic monitoring (lightweight, focused) instead.

Senior signal: layered approach, awareness that observability and rollback matter as much as testing, and synthetic-tenant pattern for mutating verification.

// WHAT INTERVIEWERS LOOK FOR

Canary + synthetic + progressive rollout + safe-mutate via synthetic tenant + fast rollback. The breadth of layers signals operational maturity.

// COMMON PITFALL

Running a big E2E suite against prod 'because we want production confidence'. The suite pollutes data, occasionally clicks real buttons, and ages into a maintenance nightmare.