Skip to content

CONCEPT Cited by 6 sources

RPO / RTO (recovery point / time objectives)

Definition

The two canonical Disaster Recovery budget dimensions:

  • RPO — Recovery Point Objective — the maximum acceptable amount of data (measured in time of loss) between the last recoverable point and the disaster. RPO = "how much work am I willing to lose?"
  • RTO — Recovery Time Objective — the maximum acceptable duration of downtime before the workload is operational again in the recovery environment. RTO = "how long am I willing to be down?"

Both are business-driven budgets — not engineering specifications. They are chosen first, then the DR tier is chosen to meet them.

Order-of-magnitude mapping to DR tiers

The DR ladder is essentially an RPO/RTO-vs-cost trade curve:

Tier RPO RTO Cost
Backup-and-restore Hours (snapshot interval) Hours–days Lowest
Pilot light Minutes–seconds (with continuous replication) Minutes–hours (compute cold-start) Low
Warm standby Seconds Seconds–minutes Higher
Multi-site active-active ~0 (continuous dual-write) ~0 (already serving) Highest

Canonical AWS-primitive RPO/RTO

Primitive RPO RTO Canonical wiki source
systems/aws-backup Hours (schedule-based) Hours (restore time) sources/2026-03-31-aws-streamlining-access-to-dr-capabilities
EBS snapshots / AMIs Snapshot interval (hours) Minutes–hours sources/2026-03-31-aws-streamlining-access-to-dr-capabilities
AWS DRS Seconds (crash-consistent, continuous) 5–20 minutes typical sources/2026-03-31-aws-streamlining-access-to-dr-capabilities

Why both are needed

RPO and RTO can have opposite cost drivers:

  • Cheap RPO, expensive RTO: continuous replication to cold staging — no data loss, long recovery time.
  • Cheap RTO, expensive RPO: warm standby of stateless compute with infrequent snapshots — recover fast, lose more data.
  • Both small: multi-site active-active — pay for both continuous replication and continuously-live secondary — the most expensive tier.

DR sizing almost always starts with "our business can tolerate ≤ X minutes of data loss and ≤ Y minutes of downtime" and works backwards to pick the tier.

RPO/RTO in the cross-partition axis

sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty applies the same RPO/RTO framing to the cross-partition axis — same tiers, more expensive at each because no cross-partition equivalent of S3 Cross-Region Replication / Transit Gateway / Route 53 cross-Region health checks exists. The RPO/RTO numbers are also the basis for picking pilot-light as the cross-partition default (acceptable RPO/RTO for the discrete sovereignty-driven failover demand profile).

Seconds-RPO / seconds-RTO on a streaming cluster via hot-standby clone (Redpanda 25.3)

Redpanda Shadowing (2025-11-06 preview) adds a third point on Redpanda's cross-region DR axis: seconds-RPO + seconds-RTO via a hot-standby clone in a second region, without the per-write cross-region RTT cost of a stretch cluster and without the connector overhead of MM2.

Canonical verbatim from the sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more|25.3 launch post:

"Recovery Point Objectives (RPOs) measured in a few seconds and similar Recovery Time Objectives (RTOs), limited only by timeout settings for producers and consumers."

The three-point Redpanda DR axis on RPO/RTO:

Shape RPO RTO Per-write cost
Stretch cluster 0 Seconds (Raft re-election) Cross-region RTT on every acks=all
Shadowing (25.3) Seconds Seconds (client timeout-bound) None (async)
MM2 Seconds–minutes (lag-dependent) Seconds + offset-translation-map lookup None (async)

The Shadowing RTO is strictly shorter than MM2's because offset preservation removes the translation-map step from consumer failover.

SLA-RPO vs measured-case-RPO (Shadow Linking scale-test)

The 2026-04-21 Shadow Linking deep-dive reports a scale-tested RPO two orders of magnitude better than the 25.3 SLA:

"I recently scale-tested shadowing, driving the source cluster at 2.5 GiB/s. During that experiment, I was able to replicate with a total lag (across all topics) that was consistently lower than 10,000 messages — on a workload producing 2.5 million messages per second — giving us an effective RPO of around 4 milliseconds on average."

Quantity Value
25.3 SLA-RPO "measured in a few seconds"
2026-04-21 measured-case RPO ~4 ms average
Source throughput at test 2.5 GiB/s
Message rate at test 2.5 M msg/s
Total-cluster lag at test <10,000 messages

The measured RPO is derived from message-count lag ÷ throughput. The 4 ms number is the broker's internal lag converted to wall-clock at the test workload's production rate.

SLA-RPO and measured-RPO are different quantities — the SLA is what the vendor guarantees regardless of workload; the measured case is what the vendor achieves on a specific workload. Operators sizing DR should pick the SLA number as the planning ceiling (the system will be slower on their workload than on a benchmark) but can reasonably expect better than SLA in the common case on most workloads.

RPO=0 on a multi-region streaming cluster

The canonical RPO=0 shape in streaming/messaging is the multi-region stretch cluster — a single cluster whose per-partition Raft groups span regions. Every acks=all write is acknowledged only after a cross-region quorum has persisted it, so no acknowledged write is lost on region failure. Leader re-election from a surviving in-sync replica is automatic via Raft, yielding low RTO as well.

From Redpanda's 2025-02-11 stretch-clusters post:

"Unlike in asynchronous replication, where you have two separate clusters with MirrorMaker2 replication between them and a non-zero RPO, multi-region clusters have RPO=0 and very low RTO when there is a region-level outage. This is because new leaders are automatically elected — as part of the Raft protocol — in the surviving regions when any region goes down. The replication factor on the cluster or topics tells you how many region failures can be tolerated for the cluster to continue to serve the application layer."

The trade-off is per-write cross-region RTT on every acks=all commit (30-80 ms regional, 150+ ms transoceanic). This is the canonical latency-cost-of-RPO=0 on a streaming workload. The alternative shape — MirrorMaker2 async between two independent clusters — pays zero cross-region wait per write but concedes non-zero RPO (= replication lag at outage).

Seen in

Last updated · 542 distilled / 1,571 read