Skip to content

SYSTEM Cited by 2 sources

Redpanda Shadowing

Shadowing is a Redpanda 25.3 feature (2025-11-06 preview) that creates a byte-for-byte, offset- preserving, hot-standby clone of a source Redpanda cluster in a different region. Shadowing is positioned as Redpanda's first-party answer to cross-region disaster recovery for the streaming-broker substrate, replacing the connector-mediated shape of MirrorMaker2 and the prior Redpanda Migrator.

Canonical verbatim from the sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more|25.3 launch post:

"Shadowing creates a fully functional, hot-standby clone of your entire Redpanda cluster — topics, configs, consumer group offsets, ACLs, schemas — the works!"

"Shadowing is built into the Redpanda broker itself and uses the standard Kafka API to link clusters. No MirrorMaker 2 or Redpanda Migrator connectors are used under the hood."

Load-bearing properties

  1. Byte-for-byte, offset-preserving replication. The shadow cluster holds the same per-partition offsets as the source. Consumers that fail over to the shadow resume at the same offsets they held on the source, with no offset-translation bookkeeping. Contrast MirrorMaker2 which maintains a per-consumer-group offset-translation map because the two clusters produce messages with independent offset numbering.

  2. Broker-internal. The replication mechanism lives inside the Redpanda broker and uses the Kafka wire protocol to link clusters, not a separate Kafka Connect cluster with MM2 connectors. See concepts/broker-internal-cross-cluster-replication.

  3. Asynchronous. "Shadowing combines an asynchronous replication mechanism with offset preservation." Writes on the source are not blocked on shadow-cluster acknowledgement; the shadow catches up on its own schedule.

  4. Full-cluster clone, not a topic-level mirror. The clone covers "topics, configs, consumer group offsets, ACLs, schemas — the works". This is stronger than MM2's topic-granularity mirroring + separate schema-registry replication + separate ACL replication — all in one feature.

  5. Seconds-range RPO/RTO. "Recovery Point Objectives (RPOs) measured in a few seconds and similar Recovery Time Objectives (RTOs), limited only by timeout settings for producers and consumers." Positioned between stretch cluster RPO=0 and MM2 non-zero-RPO-dependent-on-lag.

Setup surface

"Any cluster can create a shadow link to a source cluster. With a few lines of configuration or using the interactive process in Redpanda Console, you can enable a shadow link in seconds to start replicating your most critical workflows."

The Redpanda Console exposes a monitoring view for lag + throughput on the shadow cluster (verbatim "Monitoring a shadow cluster in Redpanda Console" from the post).

Position in the Redpanda DR axis

Redpanda's cross-region DR / HA axis now has three shapes:

Shape RPO RTO Cost Complexity
Stretch cluster 0 Very low (Raft re-election) Per-write cross-region RTT + egress Single cluster
Shadowing Seconds Seconds (client timeout-bound) Async bandwidth + 2× cluster cost Two clusters, single feature
MirrorMaker2 Non-zero (lag) Connector restart + offset-translation 2× cluster + MM2 infra Highest (Kafka Connect cluster)

Shadowing occupies the "two-cluster DR without the connector overhead" slot — offset preservation removes the client-side failover friction that's MM2's load-bearing operational cost; broker-internal implementation removes the Kafka-Connect operational surface.

Contrast with shadow-cluster (release-validation sense)

Not to be confused with shadow cluster in the release-validation sense (Meta Presto's shadow cluster for catching post-compilation regressions on long-running queries). Redpanda's Shadowing is a DR mechanism; the Presto-style shadow cluster is a pre-promotion validation mechanism. Same word, different pattern family.

Target workload

"Designed for workloads with high-throughput, low-latency requirements and can operate with a low, but nonzero, RPO and RTO." — i.e. the same write profile stretch cluster targets but with independent per-region clusters instead of one stretched cluster.

Mechanism (2026-04-21 deep-dive)

The 2026-04-21 Shadow Linking deep-dive post walks the replication mechanism at broker-internal altitude:

"A shadow link is defined within the shadow cluster and creates tasks internal to the broker that read data from the source cluster and write it locally. These tasks read data from the source using the standard Kafka API. Once the link is established, topics will be created and configured automatically, ACLs will be applied, commits will be replicated, and (of course!) messages will be mirrored, all on a continuous basis."

"Each broker in the shadow cluster runs replication tasks that read directly from the brokers in the source cluster, enabling massively parallel data transfer. This fully distributed approach provides excellent throughput and allows you to scale replication capacity simply by adding more brokers, up to the limit of your network."

The canonical wiki concept for this mechanism is concepts/parallel-broker-replication-tasks — the shared-nothing property that makes replication throughput scale linearly with broker count until network saturates.

Three properties fall out:

  • Source cluster is unaware of the link. "A shadow link is configured only on the destination cluster only. The source cluster is completely unaware of the link, aside from the additional read workload it sees." The source cluster needs no config change and no operator action to participate in a shadow link; the shadow cluster unilaterally stands up the replication.
  • Two-axis scaling. "Shadow linking also scales naturally with the cluster, both vertically and horizontally. If you use bigger nodes with more cores, Redpanda's internal shared-nothing architecture can use that to its fullest. If you scale out the cluster and add more nodes, we will use them to increase the shadowing parallelism, all without you needing to tune anything out of the box."
  • Shadow topics are read-only until failover. "While the client of a shadow link is writing to a topic, that topic is read-only to all other producers, ensuring that the topic stays in sync with the source and doesn't diverge in contents. It will only become writable once failed over." Enforced at the broker, not at the client — prevents accidental split-brain writes on the shadow side.

Performance (scale-tested)

From the 2026-04-21 deep-dive:

"As an illustration of the performance, I recently scale-tested shadowing, driving the source cluster at 2.5 GiB/s. During that experiment, I was able to replicate with a total lag (across all topics) that was consistently lower than 10,000 messages — on a workload producing 2.5 million messages per second — giving us an effective RPO of around 4 milliseconds on average."

Canonical performance disclosure for Shadow Linking:

Quantity Value
Source throughput 2.5 GiB/s
Message rate 2.5 million msg/s
Total-cluster replication lag < 10,000 messages
Effective RPO ~4 ms average

The 2.5 GiB/s is two orders of magnitude more aggressive than the 25.3 launch post's "RPO … measured in a few seconds" SLA ceiling — the measured case is closer to the stretch-cluster RPO=0 profile than to MM2's seconds-to-minutes profile.

The lag is reported as message count rather than wall-clock time because the broker's native unit is offsets + messages — see concepts/replication-lag-message-count for the RPO-as- derived-quantity framing. (At 2.5 M msg/s, every 10,000-message reduction in lag shrinks RPO by ~4 ms.)

Performance is implemented in C++, not Java: "the Shadowing components are written in high-performance C++, which means that not only do you get great replication performance, but there's also no Kafka Connect and no JVM tuning in sight." Shadow Linking inherits the rest of Redpanda's no-GC-pause throughput profile.

Five replication axes

The 2026-04-21 deep-dive enumerates the five things Shadow Linking replicates per topic — named in the 25.3 post at list altitude, walked here at one-sentence-mechanism altitude:

  1. Topic data"All records are replicated byte-for-byte, preserving offsets, timestamps, headers compression, and batching." Canonical offset-preserving
  2. byte-for-byte.
  3. Topic configurations"This includes the partition count and topic properties such as retention, compression, and cleanup policy. Not all properties are replicated." Partial config replication; excluded properties deferred to the docs.
  4. Consumer group data"Committed offsets and group membership, enabling failover of consumers."
  5. ACLs / security policies"Access control lists are replicated to ensure consistent authorization across clusters."
  6. Schema registry data"The _schemas topic can be replicated when the feature is enabled, allowing schemas (and schema settings, such as compatibility) to be replicated." Schema registry replication is Redpanda's schema registry stored as a Kafka topic named _schemas; replicating it is just-another-topic replication. Off by default — a DR-critical footgun if the operator doesn't know to enable it.

The 2026-04-21 deep-dive introduces a first-class primitive absent from the 25.3 launch post: failover can be invoked per-topic or per-link:

"When you failover a link, either by topic or entirely, the replication flows stop and the linked topics will become writable to regular producers."

"Keep in mind that if you have an app-level outage, you don't need to failover the whole link — just failover individual topics as needed."

Canonical wiki concept: concepts/per-topic-granularity-failover. Canonical pattern: patterns/topic-level-granular-dr-failover.

Two tools matched to two outage shapes:

  • App-level outage (one service's topic family broken) → failover(topic, link). Other topics stay on the source cluster, unaffected.
  • Region-level outage (source cluster unreachable) → failover(link). All topics promote to writable on the shadow.

Per-topic granularity composes with always-be-failing-over drill discipline — small, per-topic-family DR drills are operationally feasible at a cadence much higher than whole-link drills.

The 2026-04-21 deep-dive discloses a link-deletion guardrail absent from the 25.3 post:

"You can only delete a shadow link once all of the flows are failed over and there are no active replication flows. This is A Good Thing™."

Canonical safety invariant: a shadow link with active replication flows cannot be deleted. The operator must first either fail over every flow (making them writable on the shadow) or drain them to inactive state. This prevents the operator-error shape where someone deletes a link thinking it's unused, leaving consumers suddenly pointed at a cluster that's no longer being fed data.

The 2026-04-21 deep-dive introduces a two-cluster deployment shape where both clusters run shadow links to each other:

"This kind of reciprocal active-passive architecture, in which both clusters are active and usable, can still be achieved with parallel shadow links."

"Running a reciprocal active-passive cluster pair is as simple as configuring two shadow links — one on each cluster."

Each cluster is simultaneously a source (for its own data) and a shadow (for the other's data). The load-bearing discipline is topic-name prefix convention — the a_ / b_ (or region- code / DC-code) prefix that encodes origin cluster into the name itself:

"This design benefits from using a consistent prefix to name topics and consumer groups, identifying their source site."

Canonical concepts:

Canonical pattern:

One asymmetry in the reciprocal topology: schema registry must have a single primary site because both clusters would otherwise write to the same _schemas topic. "A primary site for schema registry would need to be chosen (since both sites will use _schemas)." Topic layer is symmetric; schema- registry layer is not.

Not active-active: each topic still has exactly one writer (its owning cluster). No write-conflict problem, no conflict- resolution machinery. The reciprocal shape gets the bidirectionality at the aggregate workload level, not at the per-topic level.

Hardware cost vs MirrorMaker2

The 2026-04-21 post makes explicit the hardware-cost contrast the 25.3 post only implied:

"Consider replicating a stream of messages at 1GiB/s using an external tool such as MirrorMaker: In addition to the source and sink clusters, you would need another cluster to host the replication workload. In contrast, when using shadowing, no additional hardware is needed."

  • MM2: 3 clusters (source + sink + Connect cluster).
  • Shadow Linking: 2 clusters (source + shadow).

At 1 GiB/s workloads this is ~50% additional infrastructure cost for the connector-based shape, on top of the per-message fidelity gap (MM2 can produce duplicates on the destination; Shadow Linking doesn't).

Observability surface

Three surfaces per the 2026-04-21 deep-dive:

  • Prometheus-compatible metrics"Prometheus-compatible metrics to see the link status, including replication lag, are published by the broker, so your existing monitoring will automatically pick them up."
  • Redpanda Console — interactive GUI for link state and replication flows.
  • rpk + REST — scripting / automation.

Replication lag is natively reported in message count (see concepts/replication-lag-message-count); wall-clock RPO is derived from lag ÷ throughput.

Failover runbook

A full operational runbook lives at docs.redpanda.com/current/manage/disaster-recovery/shadowing/failover-runbook/ — named in the 2026-04-21 post as "Definitely one to keep bookmarked!". The blog post does not walk the runbook's contents.

Status

  • 2025-11-06 preview — Shadowing introduced in 25.3 launch post.
  • 2026-04-21 mechanism deep-dive — scale-tested at 2.5 GiB/s / 2.5 M msg/s / 4 ms RPO average; reciprocal active-passive, per-topic failover, link-deletion safety canonicalised. Implied GA status (the post describes the feature in present tense throughout) but explicit beta/GA classification not stated.

Seen in

Last updated · 542 distilled / 1,571 read