Skip to content

PATTERN Cited by 1 source

Topic-level granular DR failover

Pattern

Expose DR failover at both link granularity and per-topic granularity, so operators can match the failover tool's blast radius to the outage's blast radius:

failover(link)              # whole-link: region-level outage
failover(topic, link)       # per-topic: app-level outage

The pattern is a specialisation of hot-standby cluster for DR that refines the failover primitive's granularity. Instead of one tool with one blast radius, the DR substrate offers a hierarchy of failover scopes matching the outage taxonomy.

Canonicalised by the 2026-04-21 Redpanda Shadow Linking deep-dive:

"When you failover a link, either by topic or entirely, the replication flows stop and the linked topics will become writable to regular producers."

"Keep in mind that if you have an app-level outage, you don't need to failover the whole link — just failover individual topics as needed."

Problem

Outages rarely happen at cluster scope. Common shapes:

  • App-level outage — one service's topic family is broken (poisoned message, schema incompatibility, producer crash); the rest of the cluster is fine.
  • Topic-family operational issue — one topic's configuration or data is corrupted; surrounding topics are unaffected.
  • Region-level outage — whole source cluster unreachable.

A DR mechanism whose only failover primitive is whole-link forces every outage through the same tool:

  1. A small app-level outage triggers a whole-cluster failover (over-reaction).
  2. The blast radius of the failover is cluster-wide — every producer and consumer has to reconfigure.
  3. Recovery back to the primary region after the app fix is another whole-cluster operation.

The mismatch between outage scope (app-level) and failover scope (cluster-level) turns small incidents into large ones.

Solution

Expose failover as a granular operation matching the shard-of-work unit the application teams reason about:

  • Per-topic failover for app-level and topic-specific outages.
  • Whole-link failover for region-level outages.

Each primitive fails over exactly the affected scope and leaves everything else running. Producers and consumers for unaffected topics notice nothing.

Mechanics

Per-topic failover works because:

  1. Each topic on a shadow link has independent replication state (its own offsets being replicated, its own lag).
  2. Failing over one topic means stopping only that topic's replication flow and promoting only that topic to writable on the destination.
  3. Other topics on the same shadow link continue replicating normally.

The underlying requirement is that the shadow-link mechanism supports per-topic state transitions, not only link-scoped transitions. Redpanda Shadow Linking does; some other replication shapes (e.g. a single batched replication stream with no per-topic demux) might not.

Outage indicator Tool Rationale
One topic's consumers are stuck, producers unaffected failover(topic) Consumer-side problem isolated to one topic; try the shadow for consumers
Schema incompatibility on one topic failover(topic) Schema issues are topic-scoped
Poisoned message in one topic preventing consumer progress failover(topic) Isolated blast radius
Source cluster shows elevated error rates across many topics failover(link) Cluster-wide degradation
Source region is unreachable failover(link) Whole region gone
Planned source-cluster upgrade / migration failover(link) Planned whole-cluster cutover

DR-drill composition

Per-topic failover makes DR drills substantially cheaper to run:

  • Whole-link drill: requires coordination with every team using any topic on the cluster, a sanctioned downtime window, a rollback procedure for each team's consumers.
  • Per-topic drill: requires coordination with one team for one topic family; the rest of the cluster is unaffected.

This composes with the always-be-failing-over drill discipline: regular small per-topic drills build confidence in the shadow-link + consumer-reconfiguration path per topic family, accumulating into high confidence in the whole-link failover path without ever needing a big- blast-radius drill.

The 2026-04-21 post connects this explicitly:

"This simplicity means that failover isn't something to fear, but something that can become routine. By practicing failover, teams can provide verifiable evidence of their disaster recovery readiness."

Per-topic granularity is what makes practicing operationally feasible at a cadence higher than "once a year if we're lucky".

Consumer-side consequence

When a single topic fails over, the client fleet ends up temporarily split across two clusters: the failed-over topic's clients point at the destination, everything else points at the source. This is fine for topic-independent workloads but creates:

  • Extra connection overhead — clients that use multiple topics may need connections to both clusters.
  • Credential rotation overhead — if the two clusters use different auth materials, clients of the failed-over topic need to rotate.
  • Observability overhead — operators need to watch both clusters for failed-over-topic activity.

The overhead is proportional to the number of topics failed over independently. At the limit (every topic failed over individually), the overhead equals whole-link failover's overhead — but that scenario is exactly the one you'd use whole-link for.

The 2026-04-21 post pairs per-topic failover with a link- deletion guardrail:

"You can only delete a shadow link once all of the flows are failed over and there are no active replication flows."

The composition: the operator can fail over topics one at a time; only once every flow is failed-over or inactive can the link be deleted. This prevents the race where link cleanup interferes with in-flight replication of still-active topics.

Generalisations

Topic-level granular DR failover is a specific instance of the broader sub-cluster DR granularity pattern:

  • Per-shard failover in a sharded DB.
  • Per-tenant failover in a multi-tenant SaaS.
  • Per-namespace failover in a service mesh.

The common property: the DR primitive's granularity matches the operational team's granularity of ownership. Teams own topic families (or shards, or tenants, or namespaces); DR tools that match that granularity keep small incidents small.

Seen in

Last updated · 550 distilled / 1,221 read