Disaster Recovery Checklist for Cloud Infrastructure

Hope is not a strategy. Every cloud deployment needs a tested disaster recovery plan. This checklist covers the essential elements of DR planning and execution.

RTO/RPO Definition

Key Concepts

RTO (Recovery Time Objective): How long can you be down?
RPO (Recovery Point Objective): How much data can you lose?

Define Per-System Requirements

System	RTO	RPO	DR Strategy
Payment processing	15 minutes	0 (no data loss)	Active-active
Customer-facing app	1 hour	5 minutes	Hot standby
Internal tools	4 hours	1 hour	Warm standby
Analytics	24 hours	24 hours	Backup/restore

See our active-active vs active-passive guide for implementation details.

Backup Strategy Checklist

☐ Database Backups

☐ Automated daily backups enabled
☐ Point-in-time recovery configured
☐ Cross-region backup replication
☐ Backup retention meets compliance requirements
☐ Backup encryption enabled

☐ Application Data

☐ S3 cross-region replication for critical buckets
☐ EBS snapshots with appropriate frequency
☐ EFS backup with AWS Backup
☐ Configuration data backed up (not just in repo)

☐ Infrastructure as Code

☐ All infrastructure defined in Terraform/CloudFormation
☐ IaC stored in version control
☐ Ability to deploy to another region
☐ Secrets management in place (not in code)

Network DR Checklist

☐ DNS Configuration

☐ Low TTL for critical records (5 minutes or less)
☐ Health checks configured for failover records
☐ Secondary DNS provider for zone redundancy
☐ DNS failover tested and documented

☐ Load Balancer Failover

☐ Health check thresholds appropriate
☐ Target deregistration delay configured
☐ Cross-zone load balancing enabled
☐ Multi-region strategy defined

☐ VPN/Direct Connect

☐ Redundant VPN connections
☐ VPN backup for Direct Connect
☐ BGP failover tested
☐ On-premise routing to DR region configured

Compute DR Checklist

☐ Container/Kubernetes

☐ Container images in multiple regions
☐ Kubernetes manifests in version control
☐ DR cluster pre-provisioned or deployable
☐ Secrets synced to DR region

☐ EC2/VMs

☐ AMIs copied to DR region
☐ Launch templates ready in DR region
☐ Auto Scaling groups configured
☐ User data scripts tested in DR region

☐ Serverless

☐ Lambda functions deployed to DR region
☐ API Gateway configured in DR region
☐ Environment variables and secrets available

Data Replication Checklist

☐ Database Replication

☐ Cross-region read replica configured
☐ Replication lag monitored and alerted
☐ Promotion procedure documented
☐ Connection string switchover planned

☐ Cache/Session Data

☐ Redis/ElastiCache global datastore (if needed)
☐ Session data in database (not local cache)
☐ Cart/checkout data replicated

☐ File Storage

☐ S3 cross-region replication enabled
☐ Replication rules cover all critical buckets
☐ Replication metrics monitored

Runbook Checklist

☐ Failover Runbook Includes:

☐ Decision criteria (when to failover)
☐ Authorization process (who approves)
☐ Step-by-step failover procedure
☐ Verification steps (how to confirm success)
☐ Communication template (customer notification)
☐ Rollback procedure

☐ Contact Information

☐ On-call rotation current
☐ Escalation path defined
☐ Vendor support contacts (AWS, etc.)
☐ Executive notification list

Testing Checklist

☐ Regular Testing

☐ Quarterly failover drills scheduled
☐ Backup restore tested monthly
☐ Runbooks reviewed after each test
☐ Test results documented

☐ DR Test Scenarios

☐ Single AZ failure
☐ Full region failure
☐ Database corruption (point-in-time restore)
☐ DNS provider failure
☐ Third-party service outage

Recovery Procedure Template

## DR Event: [Event Type]
## Severity: [Critical/High/Medium]
## Date: [Date/Time UTC]

### 1. Detection
- [ ] Alert received: [Time]
- [ ] Initial assessment completed
- [ ] Severity confirmed

### 2. Decision
- [ ] Failover decision made by: [Name]
- [ ] Time of decision: [Time]
- [ ] Reason: [Brief description]

### 3. Failover Execution
- [ ] DNS failover initiated
- [ ] Database promotion started
- [ ] Application traffic redirected
- [ ] Verification tests passed

### 4. Communication
- [ ] Status page updated
- [ ] Customer notification sent
- [ ] Executive team notified

### 5. Post-Recovery
- [ ] Primary region status monitored
- [ ] Failback plan prepared
- [ ] Post-incident review scheduled

Key Takeaways

Define RTO/RPO for each system, not one-size-fits-all
Test backups by actually restoring them
Document runbooks and keep them updated
Test failover regularly, not just once
Include third-party dependencies in DR planning

Need DR Planning Help?

We design and test disaster recovery solutions. Contact us for a consultation.