Disaster Recovery Checklist for Cloud Infrastructure
Hope is not a strategy. Every cloud deployment needs a tested disaster recovery plan. This checklist covers the essential elements of DR planning and execution.
RTO/RPO Definition
Key Concepts
- RTO (Recovery Time Objective): How long can you be down?
- RPO (Recovery Point Objective): How much data can you lose?
Define Per-System Requirements
| System | RTO | RPO | DR Strategy |
|---|---|---|---|
| Payment processing | 15 minutes | 0 (no data loss) | Active-active |
| Customer-facing app | 1 hour | 5 minutes | Hot standby |
| Internal tools | 4 hours | 1 hour | Warm standby |
| Analytics | 24 hours | 24 hours | Backup/restore |
See our active-active vs active-passive guide for implementation details.
Backup Strategy Checklist
☐ Database Backups
- ☐ Automated daily backups enabled
- ☐ Point-in-time recovery configured
- ☐ Cross-region backup replication
- ☐ Backup retention meets compliance requirements
- ☐ Backup encryption enabled
☐ Application Data
- ☐ S3 cross-region replication for critical buckets
- ☐ EBS snapshots with appropriate frequency
- ☐ EFS backup with AWS Backup
- ☐ Configuration data backed up (not just in repo)
☐ Infrastructure as Code
- ☐ All infrastructure defined in Terraform/CloudFormation
- ☐ IaC stored in version control
- ☐ Ability to deploy to another region
- ☐ Secrets management in place (not in code)
Network DR Checklist
☐ DNS Configuration
- ☐ Low TTL for critical records (5 minutes or less)
- ☐ Health checks configured for failover records
- ☐ Secondary DNS provider for zone redundancy
- ☐ DNS failover tested and documented
☐ Load Balancer Failover
- ☐ Health check thresholds appropriate
- ☐ Target deregistration delay configured
- ☐ Cross-zone load balancing enabled
- ☐ Multi-region strategy defined
☐ VPN/Direct Connect
- ☐ Redundant VPN connections
- ☐ VPN backup for Direct Connect
- ☐ BGP failover tested
- ☐ On-premise routing to DR region configured
Compute DR Checklist
☐ Container/Kubernetes
- ☐ Container images in multiple regions
- ☐ Kubernetes manifests in version control
- ☐ DR cluster pre-provisioned or deployable
- ☐ Secrets synced to DR region
☐ EC2/VMs
- ☐ AMIs copied to DR region
- ☐ Launch templates ready in DR region
- ☐ Auto Scaling groups configured
- ☐ User data scripts tested in DR region
☐ Serverless
- ☐ Lambda functions deployed to DR region
- ☐ API Gateway configured in DR region
- ☐ Environment variables and secrets available
Data Replication Checklist
☐ Database Replication
- ☐ Cross-region read replica configured
- ☐ Replication lag monitored and alerted
- ☐ Promotion procedure documented
- ☐ Connection string switchover planned
☐ Cache/Session Data
- ☐ Redis/ElastiCache global datastore (if needed)
- ☐ Session data in database (not local cache)
- ☐ Cart/checkout data replicated
☐ File Storage
- ☐ S3 cross-region replication enabled
- ☐ Replication rules cover all critical buckets
- ☐ Replication metrics monitored
Runbook Checklist
☐ Failover Runbook Includes:
- ☐ Decision criteria (when to failover)
- ☐ Authorization process (who approves)
- ☐ Step-by-step failover procedure
- ☐ Verification steps (how to confirm success)
- ☐ Communication template (customer notification)
- ☐ Rollback procedure
☐ Contact Information
- ☐ On-call rotation current
- ☐ Escalation path defined
- ☐ Vendor support contacts (AWS, etc.)
- ☐ Executive notification list
Testing Checklist
☐ Regular Testing
- ☐ Quarterly failover drills scheduled
- ☐ Backup restore tested monthly
- ☐ Runbooks reviewed after each test
- ☐ Test results documented
☐ DR Test Scenarios
- ☐ Single AZ failure
- ☐ Full region failure
- ☐ Database corruption (point-in-time restore)
- ☐ DNS provider failure
- ☐ Third-party service outage
Recovery Procedure Template
## DR Event: [Event Type]
## Severity: [Critical/High/Medium]
## Date: [Date/Time UTC]
### 1. Detection
- [ ] Alert received: [Time]
- [ ] Initial assessment completed
- [ ] Severity confirmed
### 2. Decision
- [ ] Failover decision made by: [Name]
- [ ] Time of decision: [Time]
- [ ] Reason: [Brief description]
### 3. Failover Execution
- [ ] DNS failover initiated
- [ ] Database promotion started
- [ ] Application traffic redirected
- [ ] Verification tests passed
### 4. Communication
- [ ] Status page updated
- [ ] Customer notification sent
- [ ] Executive team notified
### 5. Post-Recovery
- [ ] Primary region status monitored
- [ ] Failback plan prepared
- [ ] Post-incident review scheduled
Key Takeaways
- Define RTO/RPO for each system, not one-size-fits-all
- Test backups by actually restoring them
- Document runbooks and keep them updated
- Test failover regularly, not just once
- Include third-party dependencies in DR planning
Need DR Planning Help?
We design and test disaster recovery solutions. Contact us for a consultation.