Skip to content

CONCEPT Cited by 2 sources

Disaster recovery tiers (backup / pilot light / warm standby / active-active)

Definition

The canonical AWS-lineage disaster-recovery ladder: four tiers ordered by cost, complexity, and recovery time. Picking a tier is a choice of where to trade ongoing cost for lower RTO/RPO.

Tier Secondary state Cost RTO/RPO Cross-partition fit
Backup and restore Nothing running, periodic backup copies Lowest Hours–days Second-partition backup bucket is feasible with manual copy tooling
Pilot light Data tier replicated, compute tier stopped; built-up only when needed Low Minutes–hours Strong fit — duplicate infra cost dominates cross-partition budget
Warm standby Full-stack running, smaller scale Higher Seconds–minutes Needs more complex cross-partition network + data sync
Multi-site active-active Full parallel production Highest Effectively zero Most complex network synchronization across partitions

(Source: sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty)

Why the ladder matters for cross-partition design

The AWS Sovereign Failover post is explicit that the same ladder applies across partition boundaries as across regions — only the mechanics change: "Standard cloud resilience models range from simple backups to multi-site setups, and can be implemented across multiple Availability Zones as well as multiple Regions. The same concept equally applies across multiple partitions."

What makes the partition-axis version more expensive at each tier:

  • Backups need external tooling (S3 Cross-Region Replication doesn't work across partitions).
  • Pilot light needs separate identities, PKI, and custom data-synchronization to the replicated data tier.
  • Warm standby adds the same network synchronization problem continuously instead of at failover time.
  • Active-active needs cross-partition traffic shaping, which has to be built (no Route 53 cross-partition health checks, no Global Accelerator across partitions).

"Finally, warm standby or multi-site active-active setups mainly differ in the need for more complex network synchronization across partitions." (Source: sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty)

Pilot light as the sweet spot for cross-partition

The post gives pilot light a specific endorsement for cross-partition use: "We can run an application pilot light in another partition. This greatly reduces the cost of the infrastructure required in the second partition because it will only be built up when needed."

Reasons it's the cross-partition default:

  • Second partition's steady-state spend is just the replicated data tier, not compute.
  • Duplicate IaC is tested periodically (via DR drills) rather than continuously, so the "infrastructure drift" risk is on a weekly / monthly cadence rather than a real-time one.
  • Matches the typical cross-partition demand curve — rare, discrete failover events driven by digital- sovereignty shifts, not minute-scale AZ failures.

Disaster taxonomy → tier selection

The post names three disaster classes, each pushing you toward a different answer:

  • Natural — regions in different geographic zones / features; any tier can be in-partition.
  • Technical — independent parts of the global technical infrastructure (power grids, networks); any tier can be in-partition.
  • Human-driven — political, socioeconomic, legal; this is the class that pushes you across the partition boundary. Paired with the sovereignty framing.

"The choice of Regions depends on the type of disaster you want to mitigate." (Source: sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty)

Relationship to this wiki's existing reliability patterns

The pattern is self-similar: pick an isolation boundary (AZ / region / cluster / partition), pick a DR tier, pay the cost.

Seen in

Native-AWS-primitive mapping (within a single partition)

Tier Data Compute Native primitive
Backup-and-restore Snapshots / backups Not running AWS Backup + EventBridge + Lambda automation
Pilot light Continuously replicated Stopped / minimal AWS DRS (staging); compute instantiated on failover
Warm standby Continuously replicated Running at reduced scale AWS DRS + launched instances
Multi-site active-active Bidirectional live Full parallel production Aurora Global Database / S3 CRR / Route 53 traffic-shift — not covered by a single AWS-Backup/DRS primitive

Full-workload recovery across these tiers (networking, IAM, config translation) is typically packaged by AWS Resilience Competency Partners (e.g. Arpio).

Last updated · 200 distilled / 1,178 read