PATTERN Cited by 2 sources
Hot-standby cluster for DR¶
Pattern¶
Run a continuously-up secondary cluster in a different region / failure domain, receiving async replication from the primary. The secondary is functional — "a fully functional, hot-standby clone" — ready to accept reads and writes within seconds of failover.
Hot-standby is the high-availability / low-RPO / low-RTO end of the DR tier ladder. It sits between:
- Warm standby — secondary is running but scaled down; requires scale-up before full traffic can fail over.
- Active-active / stretch cluster — two datacenters serving the same workload with sync replication; RPO=0 at the cost of per-write cross-region RTT.
A hot-standby is async-replicated and full-scale — it can take over immediately (within client timeouts), with a bounded RPO (= replication lag).
Canonical instance¶
Redpanda Shadowing (25.3, 2025-11-06) is the first wiki instance of the pattern on the streaming-broker substrate.
Canonical verbatim from the sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more|25.3 launch post:
"Shadowing creates a fully functional, hot-standby clone of your entire Redpanda cluster — topics, configs, consumer group offsets, ACLs, schemas — the works!"
"When disaster strikes, you're not restoring from a day-old backup. You're failing over to a clone that's seconds behind production."
Shadowing composes hot-standby with offset preservation and broker-internal replication to deliver seconds-RPO / seconds-RTO without a Kafka Connect operational layer.
Why hot-standby over other DR shapes¶
Hot-standby is the right answer when all three hold:
- RPO budget is seconds, not minutes. Backup/restore and pilot-light fail this test.
- RTO budget is seconds, not minutes. Warm standby with scale-up fails this test (scale-up takes minutes).
- Latency-critical writes preclude sync replication. Stretch clusters fail this test when cross-region RTT exceeds the write SLA.
Hot-standby pays 2× cluster cost (idle secondary + primary) plus replication bandwidth, in exchange for continuous readiness. It's the most expensive DR shape short of active-active.
Critical mechanics¶
A hot-standby that's merely running is not enough. The pattern works when the standby is:
- Fully replicated. All data, all configs, all ACLs, all schema registrations, all consumer-group offsets — not a data-only clone. At failover you don't want to wait for config recreation.
- Functional. The standby must be able to accept traffic immediately — same bootstrap-URL shape, same auth flow, same client API surface. No "promote the standby" intermediate step that adds minutes.
- Lag-monitored. Because the replication is async, the operator must continuously watch replication lag to know the current RPO. "Monitoring a shadow cluster in Redpanda Console" is how Shadowing surfaces this.
- DR-drill-capable. Test failover regularly so the switchover procedure is known good. Verbatim from the launch post: "create a shadow link, monitor lag and throughput on your shadow cluster, and run a DR drill."
Anti-patterns¶
- Hot standby without config replication — you have the data but not the ACLs / schemas / consumer-group offsets. Failover means hours of reconfiguration. Shadowing specifically names "topics, configs, consumer group offsets, ACLs, schemas — the works!" as a full hot-standby.
- Hot standby without offset preservation — consumers can't resume at the same offsets; the failover mechanics add an offset-translation step. See concepts/offset-preserving-replication.
- Hot standby without regular drills — the standby drifts from the primary configuration over time; the assumed seconds-RTO silently becomes an hours-RTO.
Seen in¶
- sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more — canonical wiki source on the streaming-broker substrate. Redpanda Shadowing provides the first wiki instance of a full hot-standby streaming cluster feature (topics, configs, offsets, ACLs, schemas) with broker-native offset preservation.
- sources/2026-04-21-redpanda-me-and-my-shadow-link-disaster-recovery-replication-made-easy
— Shadow Linking mechanism + performance deep-dive.
Scale-validates the pattern at 2.5 GiB/s / 2.5 M msg/s /
<10k msg lag / ~4 ms RPO. Also introduces two refinements:
(1) per-topic
failover granularity — the DR primitive is not only
failover(link)but alsofailover(topic, link), matching app-level outages with app-level failover scope; and (2) reciprocal active-passive via two parallel shadow links, turning the normally-idle secondary into a productive cluster for its own topic family. Per-topic failover composes with always-be-failing-over drill discipline to make DR exercises operationally feasible at much higher cadence than whole-link drills.
Related¶
- systems/redpanda-shadowing — canonical instance.
- systems/redpanda — the broker.
- systems/kafka — the upstream project that does not ship an equivalent hot-standby feature; MM2 is connector-based replication but not a full hot-standby clone.
- concepts/rpo-rto — the DR budget dimension hot-standby targets.
- concepts/asynchronous-replication — the replication mode that makes the shape affordable.
- concepts/offset-preserving-replication — the consumer-simplification property Redpanda adds on top.
- concepts/broker-internal-cross-cluster-replication — the architecture Redpanda uses to implement it.
- concepts/mirrormaker2-async-replication — the connector-based shape that's a partial hot-standby (data but not offsets).
- concepts/multi-region-stretch-cluster — the sync- replication alternative.
- concepts/disaster-recovery-tiers — the ladder hot-standby sits at the high-RPO / high-cost end of.
- patterns/offset-preserving-async-cross-region-replication — Redpanda Shadowing's specific composition.
- patterns/warm-standby-deployment — the lower-cost, scale-up-required neighbour.
- patterns/pilot-light-deployment — the even-lower-cost, cold-start neighbour.
- patterns/async-replication-for-cross-region — the broader replication-mode pattern family.
- patterns/topic-level-granular-dr-failover — the refinement that exposes per-topic failover granularity on top of the hot-standby shape.
- patterns/reciprocal-active-passive-via-parallel-shadow-links — the refinement that gets both clusters doing real work.
- patterns/always-be-failing-over-drill — the DR-discipline pattern per-topic failover enables at higher cadence.
- concepts/per-topic-granularity-failover — the underlying primitive.
- concepts/replication-lag-message-count — the native-to- broker lag measurement dimension.
- concepts/reciprocal-active-passive-clusters — the two-cluster architecture.