PATTERN Cited by 1 source
Hot-standby cluster for DR¶
Pattern¶
Run a continuously-up secondary cluster in a different region / failure domain, receiving async replication from the primary. The secondary is functional — "a fully functional, hot-standby clone" — ready to accept reads and writes within seconds of failover.
Hot-standby is the high-availability / low-RPO / low-RTO end of the DR tier ladder. It sits between:
- Warm standby — secondary is running but scaled down; requires scale-up before full traffic can fail over.
- Active-active / stretch cluster — two datacenters serving the same workload with sync replication; RPO=0 at the cost of per-write cross-region RTT.
A hot-standby is async-replicated and full-scale — it can take over immediately (within client timeouts), with a bounded RPO (= replication lag).
Canonical instance¶
Redpanda Shadowing (25.3, 2025-11-06) is the first wiki instance of the pattern on the streaming-broker substrate.
Canonical verbatim from the sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more|25.3 launch post:
"Shadowing creates a fully functional, hot-standby clone of your entire Redpanda cluster — topics, configs, consumer group offsets, ACLs, schemas — the works!"
"When disaster strikes, you're not restoring from a day-old backup. You're failing over to a clone that's seconds behind production."
Shadowing composes hot-standby with offset preservation and broker-internal replication to deliver seconds-RPO / seconds-RTO without a Kafka Connect operational layer.
Why hot-standby over other DR shapes¶
Hot-standby is the right answer when all three hold:
- RPO budget is seconds, not minutes. Backup/restore and pilot-light fail this test.
- RTO budget is seconds, not minutes. Warm standby with scale-up fails this test (scale-up takes minutes).
- Latency-critical writes preclude sync replication. Stretch clusters fail this test when cross-region RTT exceeds the write SLA.
Hot-standby pays 2× cluster cost (idle secondary + primary) plus replication bandwidth, in exchange for continuous readiness. It's the most expensive DR shape short of active-active.
Critical mechanics¶
A hot-standby that's merely running is not enough. The pattern works when the standby is:
- Fully replicated. All data, all configs, all ACLs, all schema registrations, all consumer-group offsets — not a data-only clone. At failover you don't want to wait for config recreation.
- Functional. The standby must be able to accept traffic immediately — same bootstrap-URL shape, same auth flow, same client API surface. No "promote the standby" intermediate step that adds minutes.
- Lag-monitored. Because the replication is async, the operator must continuously watch replication lag to know the current RPO. "Monitoring a shadow cluster in Redpanda Console" is how Shadowing surfaces this.
- DR-drill-capable. Test failover regularly so the switchover procedure is known good. Verbatim from the launch post: "create a shadow link, monitor lag and throughput on your shadow cluster, and run a DR drill."
Anti-patterns¶
- Hot standby without config replication — you have the data but not the ACLs / schemas / consumer-group offsets. Failover means hours of reconfiguration. Shadowing specifically names "topics, configs, consumer group offsets, ACLs, schemas — the works!" as a full hot-standby.
- Hot standby without offset preservation — consumers can't resume at the same offsets; the failover mechanics add an offset-translation step. See concepts/offset-preserving-replication.
- Hot standby without regular drills — the standby drifts from the primary configuration over time; the assumed seconds-RTO silently becomes an hours-RTO.
Seen in¶
- sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more — canonical wiki source on the streaming-broker substrate. Redpanda Shadowing provides the first wiki instance of a full hot-standby streaming cluster feature (topics, configs, offsets, ACLs, schemas) with broker-native offset preservation.
Related¶
- systems/redpanda-shadowing — canonical instance.
- systems/redpanda — the broker.
- systems/kafka — the upstream project that does not ship an equivalent hot-standby feature; MM2 is connector-based replication but not a full hot-standby clone.
- concepts/rpo-rto — the DR budget dimension hot-standby targets.
- concepts/asynchronous-replication — the replication mode that makes the shape affordable.
- concepts/offset-preserving-replication — the consumer-simplification property Redpanda adds on top.
- concepts/broker-internal-cross-cluster-replication — the architecture Redpanda uses to implement it.
- concepts/mirrormaker2-async-replication — the connector-based shape that's a partial hot-standby (data but not offsets).
- concepts/multi-region-stretch-cluster — the sync- replication alternative.
- concepts/disaster-recovery-tiers — the ladder hot-standby sits at the high-RPO / high-cost end of.
- patterns/offset-preserving-async-cross-region-replication — Redpanda Shadowing's specific composition.
- patterns/warm-standby-deployment — the lower-cost, scale-up-required neighbour.
- patterns/pilot-light-deployment — the even-lower-cost, cold-start neighbour.
- patterns/async-replication-for-cross-region — the broader replication-mode pattern family.