CONCEPT Cited by 1 source
Per-topic granularity failover¶
Definition¶
Per-topic granularity failover is the property of a cross- cluster disaster-recovery mechanism where failover can be invoked for an individual topic (promoting its shadow to writable) without failing over the rest of the link. The DR primitive exposes both:
failover(link) # whole-link failover — region-level DR
failover(topic, link) # per-topic failover — app-level DR
The operator picks the granularity matching the outage scope: a whole-link failover for a region-level outage, a per-topic failover for an app-level outage affecting one topic family.
Canonical wiki source¶
Introduced by the 2026-04-21 Redpanda Shadow Linking deep-dive:
"When you failover a link, either by topic or entirely, the replication flows stop and the linked topics will become writable to regular producers."
"Keep in mind that if you have an app-level outage, you don't need to failover the whole link — just failover individual topics as needed."
Why per-topic granularity matters¶
Outage scope rarely matches cluster scope¶
Real outages are rarely whole-region. Common shapes:
- App-level outage — one service's topic family is broken (producer crashed, schema incompatibility, poisoned message) while every other service on the cluster is fine. The source cluster is operational; the application feeding a specific topic family is not.
- Topic-family-specific operational issue — one set of topics is unavailable due to a corruption, cleanup-policy misconfiguration, or access-control change on the source.
- Region-level outage — the whole source cluster is unreachable.
A DR primitive that only supports whole-link failover forces every outage through the biggest-blast-radius tool. An app-level outage fails over the whole cluster (even though 99% of topics are fine), doubling the blast radius of the failover and making the app-level incident feel like a regional incident.
Two outage shapes → two failover tools¶
Per-topic granularity failover lets the operator match tool to problem:
| Outage type | Failover tool | Blast radius |
|---|---|---|
| Region-level (source cluster unreachable) | failover(link) |
All topics on the link |
| App-level (one topic family broken) | failover(topic, link) |
One topic's producers and consumers |
| Corrupted-schema / poisoned-topic | failover(topic, link) |
One topic |
The whole-link failover is the escalation, not the default. The per-topic primitive keeps small incidents small.
DR drills become routine, not scary¶
Testing whole-link failover requires a sanctioned downtime window and co-ordination across every team writing to or reading from any topic on the cluster. Testing per-topic failover requires only co-ordination with the team owning that topic family. This shrinks the scheduling and approval overhead for DR drills by an order of magnitude and composes with the always-be-failing-over discipline: regular small per-topic drills accumulate into high confidence in the whole-link failover path without ever needing a regional-scale drill.
The 2026-04-21 post connects this explicitly to the "failover isn't something to fear, but something that can become routine" framing that follows the Shadow Linking simplicity argument.
Consumer-side semantics¶
When a single topic fails over:
- Its shadow copy on the destination cluster becomes writable.
- Producers for that topic reconfigure to write to the destination cluster.
- Consumers for that topic reconfigure to read from the destination cluster (offset-preserved, so they resume at the same committed offset — see concepts/offset-preserving-replication).
- All other topics on the link stay put — their producers and consumers continue to use the source cluster unchanged.
This means the client fleet ends up temporarily split across two clusters: the failed-over topic's clients point at the destination, everything else points at the source. This is fine for topic-independent workloads but creates a connection-count and credential-rotation overhead proportional to the number of topics failed over independently.
Relationship to link-deletion safety¶
The 2026-04-21 post pairs per-topic failover with a link- deletion guardrail:
"You can only delete a shadow link once all of the flows are failed over and there are no active replication flows. This is A Good Thing™."
The guardrail composes with per-topic failover: the operator can fail over topics one at a time, leaving others replicating normally; only once every topic has been failed over (or the remaining flows are explicitly inactive) can the link be deleted. This prevents the common operator-error shape where the cleanup of a DR setup races with in-flight replication.
Generalisation¶
Per-topic failover granularity is a specific instance of the broader sub-cluster DR granularity property. Other shapes that admit the same refinement:
- Per-shard failover in a sharded database — fail over one shard while the others stay on the primary region.
- Per-tenant failover in a multi-tenant SaaS — fail over one tenant's data without affecting others.
- Per-namespace failover in a Kubernetes/Envoy-style service mesh — move one namespace's traffic to the DR region.
The common property is: the DR primitive exposes the same unit the outage taxonomy uses. Topic is the natural unit for streaming systems because application teams own topic families and app-level outages are topic-scoped.
Seen in¶
- sources/2026-04-21-redpanda-me-and-my-shadow-link-disaster-recovery-replication-made-easy — canonical wiki source. Introduces per-topic failover as a first-class primitive alongside whole-link failover in Redpanda Shadow Linking, with the "app-level outage → fail topic only" framing and the link-deletion guardrail that composes with it.
Related¶
- systems/redpanda-shadowing — the canonical wiki instance.
- systems/redpanda — the broker.
- systems/kafka — the wire protocol.
- concepts/rpo-rto — the DR budget per-topic failover contributes to (shrinking the mean blast radius of a failover event).
- concepts/blast-radius — the property per-topic failover minimises for app-level outages.
- concepts/offset-preserving-replication — the property that makes per-topic failover operationally seamless for consumers.
- patterns/topic-level-granular-dr-failover — the canonical pattern.
- patterns/hot-standby-cluster-for-dr — the parent DR-pattern family.
- patterns/always-be-failing-over-drill — the DR-discipline pattern per-topic granularity composes with.
- patterns/offset-preserving-async-cross-region-replication — the underlying DR pattern.
- companies/redpanda — the company.