PATTERN Cited by 1 source
Topic-level granular DR failover¶
Pattern¶
Expose DR failover at both link granularity and per-topic granularity, so operators can match the failover tool's blast radius to the outage's blast radius:
failover(link) # whole-link: region-level outage
failover(topic, link) # per-topic: app-level outage
The pattern is a specialisation of hot-standby cluster for DR that refines the failover primitive's granularity. Instead of one tool with one blast radius, the DR substrate offers a hierarchy of failover scopes matching the outage taxonomy.
Canonicalised by the 2026-04-21 Redpanda Shadow Linking deep-dive:
"When you failover a link, either by topic or entirely, the replication flows stop and the linked topics will become writable to regular producers."
"Keep in mind that if you have an app-level outage, you don't need to failover the whole link — just failover individual topics as needed."
Problem¶
Outages rarely happen at cluster scope. Common shapes:
- App-level outage — one service's topic family is broken (poisoned message, schema incompatibility, producer crash); the rest of the cluster is fine.
- Topic-family operational issue — one topic's configuration or data is corrupted; surrounding topics are unaffected.
- Region-level outage — whole source cluster unreachable.
A DR mechanism whose only failover primitive is whole-link forces every outage through the same tool:
- A small app-level outage triggers a whole-cluster failover (over-reaction).
- The blast radius of the failover is cluster-wide — every producer and consumer has to reconfigure.
- Recovery back to the primary region after the app fix is another whole-cluster operation.
The mismatch between outage scope (app-level) and failover scope (cluster-level) turns small incidents into large ones.
Solution¶
Expose failover as a granular operation matching the shard-of-work unit the application teams reason about:
- Per-topic failover for app-level and topic-specific outages.
- Whole-link failover for region-level outages.
Each primitive fails over exactly the affected scope and leaves everything else running. Producers and consumers for unaffected topics notice nothing.
Mechanics¶
Per-topic failover works because:
- Each topic on a shadow link has independent replication state (its own offsets being replicated, its own lag).
- Failing over one topic means stopping only that topic's replication flow and promoting only that topic to writable on the destination.
- Other topics on the same shadow link continue replicating normally.
The underlying requirement is that the shadow-link mechanism supports per-topic state transitions, not only link-scoped transitions. Redpanda Shadow Linking does; some other replication shapes (e.g. a single batched replication stream with no per-topic demux) might not.
When to use per-topic vs whole-link failover¶
| Outage indicator | Tool | Rationale |
|---|---|---|
| One topic's consumers are stuck, producers unaffected | failover(topic) |
Consumer-side problem isolated to one topic; try the shadow for consumers |
| Schema incompatibility on one topic | failover(topic) |
Schema issues are topic-scoped |
| Poisoned message in one topic preventing consumer progress | failover(topic) |
Isolated blast radius |
| Source cluster shows elevated error rates across many topics | failover(link) |
Cluster-wide degradation |
| Source region is unreachable | failover(link) |
Whole region gone |
| Planned source-cluster upgrade / migration | failover(link) |
Planned whole-cluster cutover |
DR-drill composition¶
Per-topic failover makes DR drills substantially cheaper to run:
- Whole-link drill: requires coordination with every team using any topic on the cluster, a sanctioned downtime window, a rollback procedure for each team's consumers.
- Per-topic drill: requires coordination with one team for one topic family; the rest of the cluster is unaffected.
This composes with the always-be-failing-over drill discipline: regular small per-topic drills build confidence in the shadow-link + consumer-reconfiguration path per topic family, accumulating into high confidence in the whole-link failover path without ever needing a big- blast-radius drill.
The 2026-04-21 post connects this explicitly:
"This simplicity means that failover isn't something to fear, but something that can become routine. By practicing failover, teams can provide verifiable evidence of their disaster recovery readiness."
Per-topic granularity is what makes practicing operationally feasible at a cadence higher than "once a year if we're lucky".
Consumer-side consequence¶
When a single topic fails over, the client fleet ends up temporarily split across two clusters: the failed-over topic's clients point at the destination, everything else points at the source. This is fine for topic-independent workloads but creates:
- Extra connection overhead — clients that use multiple topics may need connections to both clusters.
- Credential rotation overhead — if the two clusters use different auth materials, clients of the failed-over topic need to rotate.
- Observability overhead — operators need to watch both clusters for failed-over-topic activity.
The overhead is proportional to the number of topics failed over independently. At the limit (every topic failed over individually), the overhead equals whole-link failover's overhead — but that scenario is exactly the one you'd use whole-link for.
Link-deletion safety composes with per-topic failover¶
The 2026-04-21 post pairs per-topic failover with a link- deletion guardrail:
"You can only delete a shadow link once all of the flows are failed over and there are no active replication flows."
The composition: the operator can fail over topics one at a time; only once every flow is failed-over or inactive can the link be deleted. This prevents the race where link cleanup interferes with in-flight replication of still-active topics.
Generalisations¶
Topic-level granular DR failover is a specific instance of the broader sub-cluster DR granularity pattern:
- Per-shard failover in a sharded DB.
- Per-tenant failover in a multi-tenant SaaS.
- Per-namespace failover in a service mesh.
The common property: the DR primitive's granularity matches the operational team's granularity of ownership. Teams own topic families (or shards, or tenants, or namespaces); DR tools that match that granularity keep small incidents small.
Seen in¶
- sources/2026-04-21-redpanda-me-and-my-shadow-link-disaster-recovery-replication-made-easy — canonical wiki source. Introduces per-topic failover alongside whole-link failover in Redpanda Shadow Linking; names the app-level-outage-scope framing that motivates the sub-link granularity; links per-topic failover to the routine-DR-drill discipline and to the link-deletion guardrail.
Related¶
- systems/redpanda-shadowing — the canonical wiki instance.
- systems/redpanda — the broker.
- systems/kafka — the wire protocol.
- concepts/per-topic-granularity-failover — the canonical concept this pattern instantiates.
- concepts/blast-radius — the property per-topic failover minimises.
- concepts/rpo-rto — the DR budget this pattern refines the granularity of.
- concepts/offset-preserving-replication — the underlying property that makes per-topic failover seamless for consumers.
- patterns/hot-standby-cluster-for-dr — the parent pattern this refines.
- patterns/always-be-failing-over-drill — the DR-discipline pattern per-topic granularity enables at scale.
- patterns/offset-preserving-async-cross-region-replication — the underlying replication pattern.
- companies/redpanda — the company.