CONCEPT Cited by 6 sources
RPO / RTO (recovery point / time objectives)¶
Definition¶
The two canonical Disaster Recovery budget dimensions:
- RPO — Recovery Point Objective — the maximum acceptable amount of data (measured in time of loss) between the last recoverable point and the disaster. RPO = "how much work am I willing to lose?"
- RTO — Recovery Time Objective — the maximum acceptable duration of downtime before the workload is operational again in the recovery environment. RTO = "how long am I willing to be down?"
Both are business-driven budgets — not engineering specifications. They are chosen first, then the DR tier is chosen to meet them.
Order-of-magnitude mapping to DR tiers¶
The DR ladder is essentially an RPO/RTO-vs-cost trade curve:
| Tier | RPO | RTO | Cost |
|---|---|---|---|
| Backup-and-restore | Hours (snapshot interval) | Hours–days | Lowest |
| Pilot light | Minutes–seconds (with continuous replication) | Minutes–hours (compute cold-start) | Low |
| Warm standby | Seconds | Seconds–minutes | Higher |
| Multi-site active-active | ~0 (continuous dual-write) | ~0 (already serving) | Highest |
Canonical AWS-primitive RPO/RTO¶
| Primitive | RPO | RTO | Canonical wiki source |
|---|---|---|---|
| systems/aws-backup | Hours (schedule-based) | Hours (restore time) | sources/2026-03-31-aws-streamlining-access-to-dr-capabilities |
| EBS snapshots / AMIs | Snapshot interval (hours) | Minutes–hours | sources/2026-03-31-aws-streamlining-access-to-dr-capabilities |
| AWS DRS | Seconds (crash-consistent, continuous) | 5–20 minutes typical | sources/2026-03-31-aws-streamlining-access-to-dr-capabilities |
Why both are needed¶
RPO and RTO can have opposite cost drivers:
- Cheap RPO, expensive RTO: continuous replication to cold staging — no data loss, long recovery time.
- Cheap RTO, expensive RPO: warm standby of stateless compute with infrequent snapshots — recover fast, lose more data.
- Both small: multi-site active-active — pay for both continuous replication and continuously-live secondary — the most expensive tier.
DR sizing almost always starts with "our business can tolerate ≤ X minutes of data loss and ≤ Y minutes of downtime" and works backwards to pick the tier.
RPO/RTO in the cross-partition axis¶
sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty applies the same RPO/RTO framing to the cross-partition axis — same tiers, more expensive at each because no cross-partition equivalent of S3 Cross-Region Replication / Transit Gateway / Route 53 cross-Region health checks exists. The RPO/RTO numbers are also the basis for picking pilot-light as the cross-partition default (acceptable RPO/RTO for the discrete sovereignty-driven failover demand profile).
Seconds-RPO / seconds-RTO on a streaming cluster via hot-standby clone (Redpanda 25.3)¶
Redpanda Shadowing (2025-11-06 preview) adds a third point on Redpanda's cross-region DR axis: seconds-RPO + seconds-RTO via a hot-standby clone in a second region, without the per-write cross-region RTT cost of a stretch cluster and without the connector overhead of MM2.
Canonical verbatim from the sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more|25.3 launch post:
"Recovery Point Objectives (RPOs) measured in a few seconds and similar Recovery Time Objectives (RTOs), limited only by timeout settings for producers and consumers."
The three-point Redpanda DR axis on RPO/RTO:
| Shape | RPO | RTO | Per-write cost |
|---|---|---|---|
| Stretch cluster | 0 | Seconds (Raft re-election) | Cross-region RTT on every acks=all |
| Shadowing (25.3) | Seconds | Seconds (client timeout-bound) | None (async) |
| MM2 | Seconds–minutes (lag-dependent) | Seconds + offset-translation-map lookup | None (async) |
The Shadowing RTO is strictly shorter than MM2's because offset preservation removes the translation-map step from consumer failover.
SLA-RPO vs measured-case-RPO (Shadow Linking scale-test)¶
The 2026-04-21 Shadow Linking deep-dive reports a scale-tested RPO two orders of magnitude better than the 25.3 SLA:
"I recently scale-tested shadowing, driving the source cluster at 2.5 GiB/s. During that experiment, I was able to replicate with a total lag (across all topics) that was consistently lower than 10,000 messages — on a workload producing 2.5 million messages per second — giving us an effective RPO of around 4 milliseconds on average."
| Quantity | Value |
|---|---|
| 25.3 SLA-RPO | "measured in a few seconds" |
| 2026-04-21 measured-case RPO | ~4 ms average |
| Source throughput at test | 2.5 GiB/s |
| Message rate at test | 2.5 M msg/s |
| Total-cluster lag at test | <10,000 messages |
The measured RPO is derived from message-count lag ÷ throughput. The 4 ms number is the broker's internal lag converted to wall-clock at the test workload's production rate.
SLA-RPO and measured-RPO are different quantities — the SLA is what the vendor guarantees regardless of workload; the measured case is what the vendor achieves on a specific workload. Operators sizing DR should pick the SLA number as the planning ceiling (the system will be slower on their workload than on a benchmark) but can reasonably expect better than SLA in the common case on most workloads.
RPO=0 on a multi-region streaming cluster¶
The canonical RPO=0 shape in streaming/messaging is the
multi-region stretch
cluster — a single cluster whose per-partition
Raft groups span regions. Every
acks=all write is acknowledged only after a cross-region quorum
has persisted it, so no acknowledged write is lost on region
failure. Leader re-election from a surviving in-sync replica is
automatic via Raft, yielding low RTO as well.
From Redpanda's 2025-02-11 stretch-clusters post:
"Unlike in asynchronous replication, where you have two separate clusters with MirrorMaker2 replication between them and a non-zero RPO, multi-region clusters have RPO=0 and very low RTO when there is a region-level outage. This is because new leaders are automatically elected — as part of the Raft protocol — in the surviving regions when any region goes down. The replication factor on the cluster or topics tells you how many region failures can be tolerated for the cluster to continue to serve the application layer."
The trade-off is per-write cross-region RTT on every
acks=all commit (30-80 ms regional, 150+ ms transoceanic).
This is the canonical latency-cost-of-RPO=0 on a streaming
workload. The alternative shape —
MirrorMaker2 async
between two independent clusters — pays zero cross-region wait
per write but concedes non-zero RPO (= replication lag at
outage).
Seen in¶
- sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more — canonical wiki source for seconds-RPO + seconds-RTO via a hot-standby clone on a streaming substrate. Redpanda 25.3 Shadowing is the three-point axis's third distinct shape (alongside RPO=0 stretch cluster and non-zero-RPO MM2). Offset-preservation is the load-bearing property that bounds RTO to client-timeout settings rather than translation-map lag.
- sources/2026-04-21-redpanda-me-and-my-shadow-link-disaster-recovery-replication-made-easy — mechanism + performance deep-dive on Shadow Linking. Canonicalises measured-case RPO ~4 ms at 2.5 GiB/s / 2.5 M msg/s / <10k msg total-cluster lag — the first per-feature scale-tested RPO number for Shadowing, two orders of magnitude better than the 25.3 "few seconds" SLA. Also canonicalises message-count lag as the broker-native RPO measurement dimension (with wall-clock RPO derived as lag ÷ throughput) and per-topic vs whole-link failover granularity (see concepts/per-topic-granularity-failover) as the sub-link RPO/RTO granularity primitive matching app-level outage scope.
- sources/2025-04-23-redpanda-need-for-speed-9-tips-to-supercharge-redpanda
— canonicalises consumer commit frequency as the RPO dial
for the consume side of a streaming pipeline. Verbatim:
"In a Disaster Recovery (DR) context, be aware of your
Recovery Point Objectives (RPOs) and use those to help define
your minimum commit frequency." A consumer committing every
Cseconds can lose up toCseconds of processing progress on restart — commit frequency is the RPO for the consume side. See concepts/offset-commit-cost for the full framing. - sources/2025-02-11-redpanda-high-availability-deployment-multi-region-stretch-clusters — canonical wiki source for RPO=0 on a multi-region streaming cluster via per-partition Raft quorum across regions. Frames RPO=0 + low RTO as the load-bearing property of the stretch- cluster shape, contrasted against MirrorMaker2 async replication's non-zero RPO. Calibrates region-failure tolerance with replication factor ("The replication factor on the cluster or topics tells you how many region failures can be tolerated for the cluster to continue to serve the application layer").
- sources/2026-03-31-aws-streamlining-access-to-dr-capabilities — canonical wiki reference; quantifies DRS's seconds-RPO / 5–20-min-RTO; frames the per-tier RPO/RTO tradeoff.
- sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty — applies the same ladder to cross-partition failover; argues pilot-light as the cross-partition RPO/RTO sweet spot.
Related¶
- concepts/disaster-recovery-tiers — the ladder ordered by RPO/RTO.
- concepts/crash-consistent-replication — the consistency model that makes seconds-RPO feasible without app cooperation.
- concepts/multi-region-stretch-cluster — the canonical RPO=0 shape on a streaming/messaging cluster.
- concepts/mirrormaker2-async-replication — the non-zero-RPO async alternative shape.
- concepts/offset-preserving-replication — the structural property that bounds Shadowing's RTO to client-timeout.
- concepts/broker-internal-cross-cluster-replication — the architectural distinction that enables offset preservation.
- systems/aws-backup, systems/aws-elastic-disaster-recovery — the two AWS-native primitives spanning the tier space.
- systems/redpanda, systems/redpanda-shadowing, systems/kafka — the streaming substrate where RPO=0 via Raft quorum and seconds-RPO via broker-native hot-standby clone are canonicalised on the wiki.
- patterns/pilot-light-deployment, patterns/warm-standby-deployment — specific tier patterns.
- patterns/hot-standby-cluster-for-dr — the DR shape Shadowing instantiates.
- patterns/offset-preserving-async-cross-region-replication — Shadowing's specific composition.
- patterns/multi-region-raft-quorum — the streaming-substrate pattern that produces RPO=0.
- patterns/async-replication-for-cross-region — the pattern family RPO-non-zero cross-region replication falls under.