Skip to content

CONCEPT Cited by 1 source

Per-topic granularity failover

Definition

Per-topic granularity failover is the property of a cross- cluster disaster-recovery mechanism where failover can be invoked for an individual topic (promoting its shadow to writable) without failing over the rest of the link. The DR primitive exposes both:

failover(link)             # whole-link failover — region-level DR
failover(topic, link)      # per-topic failover — app-level DR

The operator picks the granularity matching the outage scope: a whole-link failover for a region-level outage, a per-topic failover for an app-level outage affecting one topic family.

Canonical wiki source

Introduced by the 2026-04-21 Redpanda Shadow Linking deep-dive:

"When you failover a link, either by topic or entirely, the replication flows stop and the linked topics will become writable to regular producers."

"Keep in mind that if you have an app-level outage, you don't need to failover the whole link — just failover individual topics as needed."

Why per-topic granularity matters

Outage scope rarely matches cluster scope

Real outages are rarely whole-region. Common shapes:

  • App-level outage — one service's topic family is broken (producer crashed, schema incompatibility, poisoned message) while every other service on the cluster is fine. The source cluster is operational; the application feeding a specific topic family is not.
  • Topic-family-specific operational issue — one set of topics is unavailable due to a corruption, cleanup-policy misconfiguration, or access-control change on the source.
  • Region-level outage — the whole source cluster is unreachable.

A DR primitive that only supports whole-link failover forces every outage through the biggest-blast-radius tool. An app-level outage fails over the whole cluster (even though 99% of topics are fine), doubling the blast radius of the failover and making the app-level incident feel like a regional incident.

Two outage shapes → two failover tools

Per-topic granularity failover lets the operator match tool to problem:

Outage type Failover tool Blast radius
Region-level (source cluster unreachable) failover(link) All topics on the link
App-level (one topic family broken) failover(topic, link) One topic's producers and consumers
Corrupted-schema / poisoned-topic failover(topic, link) One topic

The whole-link failover is the escalation, not the default. The per-topic primitive keeps small incidents small.

DR drills become routine, not scary

Testing whole-link failover requires a sanctioned downtime window and co-ordination across every team writing to or reading from any topic on the cluster. Testing per-topic failover requires only co-ordination with the team owning that topic family. This shrinks the scheduling and approval overhead for DR drills by an order of magnitude and composes with the always-be-failing-over discipline: regular small per-topic drills accumulate into high confidence in the whole-link failover path without ever needing a regional-scale drill.

The 2026-04-21 post connects this explicitly to the "failover isn't something to fear, but something that can become routine" framing that follows the Shadow Linking simplicity argument.

Consumer-side semantics

When a single topic fails over:

  1. Its shadow copy on the destination cluster becomes writable.
  2. Producers for that topic reconfigure to write to the destination cluster.
  3. Consumers for that topic reconfigure to read from the destination cluster (offset-preserved, so they resume at the same committed offset — see concepts/offset-preserving-replication).
  4. All other topics on the link stay put — their producers and consumers continue to use the source cluster unchanged.

This means the client fleet ends up temporarily split across two clusters: the failed-over topic's clients point at the destination, everything else points at the source. This is fine for topic-independent workloads but creates a connection-count and credential-rotation overhead proportional to the number of topics failed over independently.

The 2026-04-21 post pairs per-topic failover with a link- deletion guardrail:

"You can only delete a shadow link once all of the flows are failed over and there are no active replication flows. This is A Good Thing™."

The guardrail composes with per-topic failover: the operator can fail over topics one at a time, leaving others replicating normally; only once every topic has been failed over (or the remaining flows are explicitly inactive) can the link be deleted. This prevents the common operator-error shape where the cleanup of a DR setup races with in-flight replication.

Generalisation

Per-topic failover granularity is a specific instance of the broader sub-cluster DR granularity property. Other shapes that admit the same refinement:

  • Per-shard failover in a sharded database — fail over one shard while the others stay on the primary region.
  • Per-tenant failover in a multi-tenant SaaS — fail over one tenant's data without affecting others.
  • Per-namespace failover in a Kubernetes/Envoy-style service mesh — move one namespace's traffic to the DR region.

The common property is: the DR primitive exposes the same unit the outage taxonomy uses. Topic is the natural unit for streaming systems because application teams own topic families and app-level outages are topic-scoped.

Seen in

Last updated · 550 distilled / 1,221 read