Disaster Recovery and Failover Testing

7 min read

Every system will eventually fail. Hardware dies, software crashes, humans make mistakes, and occasionally a data centre floods. Disaster recovery (DR) is the plan for what happens when that occurs — and failover testing is how you verify the plan actually works before you need it. Systems without tested DR plans do not have DR plans; they have DR intentions.

What disaster recovery covers

DR addresses failures across a spectrum of severity:

Hardware failure — a disk fails, a server dies, a network switch stops forwarding packets. Expected to happen at scale; hardware redundancy and automatic failover should handle these without user impact.

Software failure — a critical bug crashes the application, a bad deployment corrupts database state, a dependency fails and takes the application with it. Rollback procedures, circuit breakers, and deployment safeguards exist to mitigate these.

Data loss or corruption — data is accidentally deleted, an update corrupts records, or a migration goes wrong. Backups and point-in-time recovery are the primary mitigation.

Infrastructure failure — a cloud availability zone goes down, a CDN has an outage, a DNS provider becomes unreachable. Multi-region deployment and geographic redundancy address these.

Human error — the most common cause of serious incidents. A misconfiguration, an accidentally executed DELETE without a WHERE clause, an infrastructure change that breaks a dependency.

RPO and RTO: the two metrics DR must satisfy

Two numbers define the acceptable limits for any DR scenario:

RPO (Recovery Point Objective) — how much data can the organisation afford to lose? An RPO of 1 hour means that if a failure occurs, it is acceptable to restore the system to the state it was in 1 hour ago — losing up to 1 hour of data changes. An RPO of 0 means no data loss is acceptable.

RTO (Recovery Time Objective) — how long can the system be unavailable? An RTO of 4 hours means the system must be restored within 4 hours of a failure. An RTO of 1 minute means near-instant recovery.

RPO drives backup and replication strategy. RTO drives infrastructure architecture. Together they define the DR tier the system needs.

DR strategy tiers

DR Tiers
  • – RPO: hours to days
  • – RTO: hours to days
  • – Cost: lowest
  • – Method: scheduled backups
  • – RPO: minutes to hours
  • – RTO: minutes to hours
  • – Cost: low
  • – Method: minimal always-on standby
  • – RPO: minutes
  • – RTO: under 1 hour
  • – Cost: medium
  • – Method: full but smaller standby system
  • RPO: near zero –
  • RTO: near zero –
  • Cost: highest –
  • Method: live traffic split across sites –

Backup and Restore is the simplest tier: take backups on a schedule (hourly, daily), store them durably, restore from backup when needed. Simple to implement, but restoration takes time — the RTO is measured in hours. Suitable for systems where hours of downtime and some data loss are acceptable costs.

Pilot Light keeps a minimal version of the system permanently running in a secondary region — typically just the database layer, replicated and ready. When the primary fails, the secondary environment is scaled up to full capacity. This reduces recovery time from hours to minutes.

Warm Standby runs a complete, smaller-scale version of the production system in a secondary location at all times. Traffic fails over to it quickly; it may need to scale up to handle full production load. RPO measured in minutes, RTO under an hour.

Multi-Site Active/Active runs full production capacity in multiple locations simultaneously, with live traffic distributed across all of them. Failure of one site results in traffic shifting to the others with near-zero disruption. This is the most resilient and most expensive architecture — appropriate for systems where any downtime is unacceptable.

What to test in DR

Backup restoration — the most commonly skipped test, and the most important. Backups that exist but cannot be restored are not backups. Test that restoration works by actually restoring to a test environment and verifying data integrity. Many organisations discover their backups are corrupt or incomplete only when they urgently need them.

Failover correctness — verify that failover to the secondary system actually occurs when the primary fails. Terminate the primary database node and observe that the replica is promoted and accepting traffic within the defined RTO. Kill the primary application server and verify the load balancer routes traffic to healthy instances.

Data integrity post-failover — after failover, check that the data in the secondary system is complete and consistent. Verify that in-flight transactions at the time of failure were either completed or correctly rolled back — not silently dropped.

Failback — restoring the primary system after recovery and switching traffic back. Failback is often harder than failover and equally undertested.

Recovery time measurement — during every DR test, measure actual RTO and RPO against their defined targets. A failover that takes 45 minutes when the RTO is 30 minutes is a test failure that requires remediation.

DR drills and game days

A DR plan that has never been exercised is untested theory. Teams that run regular DR drills — deliberately failing systems, working through the recovery procedures, measuring the results — find two things consistently:

  1. The procedures do not work exactly as written.
  2. The people who need to execute them have never done it before under pressure.

Both findings are much better discovered in a planned drill than during an actual incident. Game days — structured exercises where the team simulates a production failure — are the practice that ensures DR plans are executable, not just documented.

The classic example: a major bank conducted its first DR drill in years and discovered that the failover systems worked correctly — but customers could not log in after failover because the authentication service depended on a configuration file that existed only in the primary data centre. The failover worked; the recovery did not. The drill found the gap.

QA's role in DR testing

For most QA engineers, DR is primarily SRE and DevOps territory. QA contributes in specific ways:

  • Verifying user-visible behaviour during failures — do users see appropriate, clear error messages when services are unavailable, or do they see raw exceptions and stack traces?
  • Testing data integrity after recovery — verifying that records are complete, consistent, and not duplicated after a failover event.
  • Validating monitoring and alerting — confirming that failures are detected and reported within the defined time, that alerts go to the right people, and that dashboards show accurate system state during recovery.

⚠️ Common mistakes

  • Never testing backup restoration. The most important and most neglected DR test. A backup that has never been restored is unverified. Schedule regular restoration tests to a clean environment.
  • Running DR drills in isolation from production-equivalent load. A failover that works cleanly with zero traffic may fail under production load — the secondary system may not have enough capacity. Run failover tests under representative load.
  • Treating DR as a one-time setup, not an ongoing practice. Architecture changes, new services, and software updates can break DR configurations without anyone noticing. Quarterly drills catch these regressions before they become incidents.

🎯 Practice task

Design a DR test for a system you are responsible for or familiar with.

  1. Define RPO and RTO for the system. If they are not documented, propose values based on the business impact of downtime and data loss.
  2. Identify which DR tier the current architecture falls into. Does the architecture match the RPO/RTO requirements?
  3. Design one restoration test: what backup would you restore, to what environment, and how would you verify that the restored data is complete and correct?
  4. Design one failover test: what failure would you simulate, how would you measure RTO, and what user-visible behaviour would you validate during and after the failover?

This plan can be presented to the SRE or DevOps team as a QA contribution to DR readiness — QA perspectives on user experience during recovery and data integrity after recovery are often missing from purely infrastructure-focused DR plans.

// tip to track lessons you complete and pick up where you left off across devices.