Skip to content

PATTERN Cited by 1 source

Hot-standby cluster for DR

Pattern

Run a continuously-up secondary cluster in a different region / failure domain, receiving async replication from the primary. The secondary is functional"a fully functional, hot-standby clone" — ready to accept reads and writes within seconds of failover.

Hot-standby is the high-availability / low-RPO / low-RTO end of the DR tier ladder. It sits between:

  • Warm standby — secondary is running but scaled down; requires scale-up before full traffic can fail over.
  • Active-active / stretch cluster — two datacenters serving the same workload with sync replication; RPO=0 at the cost of per-write cross-region RTT.

A hot-standby is async-replicated and full-scale — it can take over immediately (within client timeouts), with a bounded RPO (= replication lag).

Canonical instance

Redpanda Shadowing (25.3, 2025-11-06) is the first wiki instance of the pattern on the streaming-broker substrate.

Canonical verbatim from the sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more|25.3 launch post:

"Shadowing creates a fully functional, hot-standby clone of your entire Redpanda cluster — topics, configs, consumer group offsets, ACLs, schemas — the works!"

"When disaster strikes, you're not restoring from a day-old backup. You're failing over to a clone that's seconds behind production."

Shadowing composes hot-standby with offset preservation and broker-internal replication to deliver seconds-RPO / seconds-RTO without a Kafka Connect operational layer.

Why hot-standby over other DR shapes

Hot-standby is the right answer when all three hold:

  1. RPO budget is seconds, not minutes. Backup/restore and pilot-light fail this test.
  2. RTO budget is seconds, not minutes. Warm standby with scale-up fails this test (scale-up takes minutes).
  3. Latency-critical writes preclude sync replication. Stretch clusters fail this test when cross-region RTT exceeds the write SLA.

Hot-standby pays 2× cluster cost (idle secondary + primary) plus replication bandwidth, in exchange for continuous readiness. It's the most expensive DR shape short of active-active.

Critical mechanics

A hot-standby that's merely running is not enough. The pattern works when the standby is:

  • Fully replicated. All data, all configs, all ACLs, all schema registrations, all consumer-group offsets — not a data-only clone. At failover you don't want to wait for config recreation.
  • Functional. The standby must be able to accept traffic immediately — same bootstrap-URL shape, same auth flow, same client API surface. No "promote the standby" intermediate step that adds minutes.
  • Lag-monitored. Because the replication is async, the operator must continuously watch replication lag to know the current RPO. "Monitoring a shadow cluster in Redpanda Console" is how Shadowing surfaces this.
  • DR-drill-capable. Test failover regularly so the switchover procedure is known good. Verbatim from the launch post: "create a shadow link, monitor lag and throughput on your shadow cluster, and run a DR drill."

Anti-patterns

  • Hot standby without config replication — you have the data but not the ACLs / schemas / consumer-group offsets. Failover means hours of reconfiguration. Shadowing specifically names "topics, configs, consumer group offsets, ACLs, schemas — the works!" as a full hot-standby.
  • Hot standby without offset preservation — consumers can't resume at the same offsets; the failover mechanics add an offset-translation step. See concepts/offset-preserving-replication.
  • Hot standby without regular drills — the standby drifts from the primary configuration over time; the assumed seconds-RTO silently becomes an hours-RTO.

Seen in

Last updated · 470 distilled / 1,213 read