How would you implement zero-downtime testing during a production deployment?

Question

Accepted Answer

Smoke tests against the new version in a canary slice, synthetic monitoring for SLO-critical paths, and progressive rollout (1%→10%→100%) with auto-rollback on error-budget burn. Read-only tests run in prod safely; mutating tests need a tagged synthetic tenant. Zero-downtime testing means you keep verifying behaviour while users are using the system, without breaking the system in the process. Layer 1 — canary smoke. Deploy to 1-5% of traffic (canary slice). Run a focused smoke test fleet against that slice via internal headers or routing rules. Watch error rate, latency, and business metrics for the canary vs. baseline. Auto-rollback or hold rollout if delta exceeds threshold. Layer 2 — synthetic monitoring. Continuous read-only probes against critical paths (login, search, checkout view) every 30-60 seconds. Tools: Datadog Synthetics, Checkly, Pingdom, Grafana Synthetic Monitoring. Pages on-call when they fail — fastest signal of user impact. Layer 3 — progressive rollout with SLO ga

How would you implement zero-downtime testing during a production deployment?

// WHAT INTERVIEWERS LOOK FOR

// COMMON PITFALL

How would you implement zero-downtime testing during a production deployment?

Short answer

Detail

// WHAT INTERVIEWERS LOOK FOR

// COMMON PITFALL