CONCEPT Cited by 2 sources
Disaster recovery tiers (backup / pilot light / warm standby / active-active)¶
Definition¶
The canonical AWS-lineage disaster-recovery ladder: four tiers ordered by cost, complexity, and recovery time. Picking a tier is a choice of where to trade ongoing cost for lower RTO/RPO.
| Tier | Secondary state | Cost | RTO/RPO | Cross-partition fit |
|---|---|---|---|---|
| Backup and restore | Nothing running, periodic backup copies | Lowest | Hours–days | Second-partition backup bucket is feasible with manual copy tooling |
| Pilot light | Data tier replicated, compute tier stopped; built-up only when needed | Low | Minutes–hours | Strong fit — duplicate infra cost dominates cross-partition budget |
| Warm standby | Full-stack running, smaller scale | Higher | Seconds–minutes | Needs more complex cross-partition network + data sync |
| Multi-site active-active | Full parallel production | Highest | Effectively zero | Most complex network synchronization across partitions |
(Source: sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty)
Why the ladder matters for cross-partition design¶
The AWS Sovereign Failover post is explicit that the same ladder applies across partition boundaries as across regions — only the mechanics change: "Standard cloud resilience models range from simple backups to multi-site setups, and can be implemented across multiple Availability Zones as well as multiple Regions. The same concept equally applies across multiple partitions."
What makes the partition-axis version more expensive at each tier:
- Backups need external tooling (S3 Cross-Region Replication doesn't work across partitions).
- Pilot light needs separate identities, PKI, and custom data-synchronization to the replicated data tier.
- Warm standby adds the same network synchronization problem continuously instead of at failover time.
- Active-active needs cross-partition traffic shaping, which has to be built (no Route 53 cross-partition health checks, no Global Accelerator across partitions).
"Finally, warm standby or multi-site active-active setups mainly differ in the need for more complex network synchronization across partitions." (Source: sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty)
Pilot light as the sweet spot for cross-partition¶
The post gives pilot light a specific endorsement for cross-partition use: "We can run an application pilot light in another partition. This greatly reduces the cost of the infrastructure required in the second partition because it will only be built up when needed."
Reasons it's the cross-partition default:
- Second partition's steady-state spend is just the replicated data tier, not compute.
- Duplicate IaC is tested periodically (via DR drills) rather than continuously, so the "infrastructure drift" risk is on a weekly / monthly cadence rather than a real-time one.
- Matches the typical cross-partition demand curve — rare, discrete failover events driven by digital- sovereignty shifts, not minute-scale AZ failures.
Disaster taxonomy → tier selection¶
The post names three disaster classes, each pushing you toward a different answer:
- Natural — regions in different geographic zones / features; any tier can be in-partition.
- Technical — independent parts of the global technical infrastructure (power grids, networks); any tier can be in-partition.
- Human-driven — political, socioeconomic, legal; this is the class that pushes you across the partition boundary. Paired with the sovereignty framing.
"The choice of Regions depends on the type of disaster you want to mitigate." (Source: sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty)
Relationship to this wiki's existing reliability patterns¶
- patterns/multi-cluster-active-active-redundancy — Figma's three-EKS-cluster topology is the cluster-level active-active instantiation, one level below the partition-level instantiation. Same shape at different scales.
- concepts/active-multi-cluster-blast-radius — same blast-radius reasoning applied to clusters instead of partitions.
The pattern is self-similar: pick an isolation boundary (AZ / region / cluster / partition), pick a DR tier, pay the cost.
Seen in¶
- sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty — names the full ladder; endorses pilot-light for cross-partition by default; names the incremental cross-partition cost at each tier.
- sources/2026-03-31-aws-streamlining-access-to-dr-capabilities — quantifies the ladder with AWS-native primitives + RPO/RTO numbers: backup-and-restore via AWS Backup (hours-to-days RPO/RTO); pilot light + warm standby via AWS DRS (seconds RPO, 5–20-min RTO with continuous block-level replication); introduces the cross-Region vs cross-account orthogonal isolation axes on top of the ladder.
Native-AWS-primitive mapping (within a single partition)¶
| Tier | Data | Compute | Native primitive |
|---|---|---|---|
| Backup-and-restore | Snapshots / backups | Not running | AWS Backup + EventBridge + Lambda automation |
| Pilot light | Continuously replicated | Stopped / minimal | AWS DRS (staging); compute instantiated on failover |
| Warm standby | Continuously replicated | Running at reduced scale | AWS DRS + launched instances |
| Multi-site active-active | Bidirectional live | Full parallel production | Aurora Global Database / S3 CRR / Route 53 traffic-shift — not covered by a single AWS-Backup/DRS primitive |
Full-workload recovery across these tiers (networking, IAM, config translation) is typically packaged by AWS Resilience Competency Partners (e.g. Arpio).
Related¶
- patterns/cross-partition-failover — the containing pattern
- patterns/pilot-light-deployment, patterns/warm-standby-deployment — two specific tiers as pattern pages
- patterns/multi-cluster-active-active-redundancy — same shape, cluster-level instantiation
- concepts/aws-partition — the isolation boundary this ladder is applied across
- concepts/digital-sovereignty — the demand