CONCEPT Cited by 1 source

Per-topic granularity failover¶

Definition¶

Per-topic granularity failover is the property of a cross- cluster disaster-recovery mechanism where failover can be invoked for an individual topic (promoting its shadow to writable) without failing over the rest of the link. The DR primitive exposes both:

failover(link)             # whole-link failover — region-level DR
failover(topic, link)      # per-topic failover — app-level DR

The operator picks the granularity matching the outage scope: a whole-link failover for a region-level outage, a per-topic failover for an app-level outage affecting one topic family.

Canonical wiki source¶

Introduced by the 2026-04-21 Redpanda Shadow Linking deep-dive:

"When you failover a link, either by topic or entirely, the replication flows stop and the linked topics will become writable to regular producers."

"Keep in mind that if you have an app-level outage, you don't need to failover the whole link — just failover individual topics as needed."

Why per-topic granularity matters¶

Outage scope rarely matches cluster scope¶

Real outages are rarely whole-region. Common shapes:

App-level outage — one service's topic family is broken (producer crashed, schema incompatibility, poisoned message) while every other service on the cluster is fine. The source cluster is operational; the application feeding a specific topic family is not.
Topic-family-specific operational issue — one set of topics is unavailable due to a corruption, cleanup-policy misconfiguration, or access-control change on the source.
Region-level outage — the whole source cluster is unreachable.

A DR primitive that only supports whole-link failover forces every outage through the biggest-blast-radius tool. An app-level outage fails over the whole cluster (even though 99% of topics are fine), doubling the blast radius of the failover and making the app-level incident feel like a regional incident.

Two outage shapes → two failover tools¶

Per-topic granularity failover lets the operator match tool to problem:

Outage type	Failover tool	Blast radius
Region-level (source cluster unreachable)	`failover(link)`	All topics on the link
App-level (one topic family broken)	`failover(topic, link)`	One topic's producers and consumers
Corrupted-schema / poisoned-topic	`failover(topic, link)`	One topic

The whole-link failover is the escalation, not the default. The per-topic primitive keeps small incidents small.

DR drills become routine, not scary¶

Testing whole-link failover requires a sanctioned downtime window and co-ordination across every team writing to or reading from any topic on the cluster. Testing per-topic failover requires only co-ordination with the team owning that topic family. This shrinks the scheduling and approval overhead for DR drills by an order of magnitude and composes with the always-be-failing-over discipline: regular small per-topic drills accumulate into high confidence in the whole-link failover path without ever needing a regional-scale drill.

The 2026-04-21 post connects this explicitly to the "failover isn't something to fear, but something that can become routine" framing that follows the Shadow Linking simplicity argument.

Consumer-side semantics¶

When a single topic fails over:

Its shadow copy on the destination cluster becomes writable.
Producers for that topic reconfigure to write to the destination cluster.
Consumers for that topic reconfigure to read from the destination cluster (offset-preserved, so they resume at the same committed offset — see concepts/offset-preserving-replication).
All other topics on the link stay put — their producers and consumers continue to use the source cluster unchanged.

This means the client fleet ends up temporarily split across two clusters: the failed-over topic's clients point at the destination, everything else points at the source. This is fine for topic-independent workloads but creates a connection-count and credential-rotation overhead proportional to the number of topics failed over independently.

Relationship to link-deletion safety¶

The 2026-04-21 post pairs per-topic failover with a link- deletion guardrail:

"You can only delete a shadow link once all of the flows are failed over and there are no active replication flows. This is A Good Thing™."

The guardrail composes with per-topic failover: the operator can fail over topics one at a time, leaving others replicating normally; only once every topic has been failed over (or the remaining flows are explicitly inactive) can the link be deleted. This prevents the common operator-error shape where the cleanup of a DR setup races with in-flight replication.

Generalisation¶

Per-topic failover granularity is a specific instance of the broader sub-cluster DR granularity property. Other shapes that admit the same refinement:

Per-shard failover in a sharded database — fail over one shard while the others stay on the primary region.
Per-tenant failover in a multi-tenant SaaS — fail over one tenant's data without affecting others.
Per-namespace failover in a Kubernetes/Envoy-style service mesh — move one namespace's traffic to the DR region.

The common property is: the DR primitive exposes the same unit the outage taxonomy uses. Topic is the natural unit for streaming systems because application teams own topic families and app-level outages are topic-scoped.

Seen in¶

sources/2026-04-21-redpanda-me-and-my-shadow-link-disaster-recovery-replication-made-easy — canonical wiki source. Introduces per-topic failover as a first-class primitive alongside whole-link failover in Redpanda Shadow Linking, with the "app-level outage → fail topic only" framing and the link-deletion guardrail that composes with it.

systems/redpanda-shadowing — the canonical wiki instance.
systems/redpanda — the broker.
systems/kafka — the wire protocol.
concepts/rpo-rto — the DR budget per-topic failover contributes to (shrinking the mean blast radius of a failover event).
concepts/blast-radius — the property per-topic failover minimises for app-level outages.
concepts/offset-preserving-replication — the property that makes per-topic failover operationally seamless for consumers.
patterns/topic-level-granular-dr-failover — the canonical pattern.
patterns/hot-standby-cluster-for-dr — the parent DR-pattern family.
patterns/always-be-failing-over-drill — the DR-discipline pattern per-topic granularity composes with.
patterns/offset-preserving-async-cross-region-replication — the underlying DR pattern.
companies/redpanda — the company.