Skip to content

CONCEPT Cited by 1 source

Offset-preserving replication

Definition

Offset-preserving replication is cross-cluster replication where the destination cluster holds the same per-partition offsets as the source cluster. A record written at offset N on the source is reachable at offset N on the destination. Consumers that fail over from source to destination resume at the same offsets they held on the source, without an offset-translation map or re-snapshot.

Offset preservation is a structural property that decides how expensive consumer failover is during disaster recovery. Without it, each consumer group needs an external map that records "offset X on source corresponds to offset Y on destination" — the map has to be kept in sync with the replication stream and consulted at failover time, adding a subsystem to the DR critical path.

Canonical wiki source

Introduced in the Redpanda 25.3 launch post as the load-bearing property of Redpanda's new Shadowing feature:

"Shadowing combines an asynchronous replication mechanism with offset preservation, allowing for multi-region disaster recovery with simpler client failover procedures."

"Shadowing creates a fully functional, hot-standby clone of your entire Redpanda cluster — topics, configs, consumer group offsets, ACLs, schemas — the works!"

The shadow cluster is "byte-for-byte, offset-preserving" — the full disclosure is in the sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-more|25.3 post.

Contrast with MirrorMaker2

MirrorMaker2 (MM2) does not offset-preserve. MM2 runs as a Kafka Connect source + destination connector pair that consumes the source cluster's topics and produces the records to the destination cluster — the destination cluster assigns its own offsets on ingest, which are independent of the source's.

To make MM2-replicated data usable for failover, MM2 maintains a per-consumer-group offset translation map in a separate Kafka topic:

  • The source topic's consumer-group commits are mirrored to the destination.
  • MM2 writes (source-offset, destination-offset) translations to a __consumer_offsets-equivalent on the destination.
  • At failover, the consumer reads its last-committed source offset, looks up the translated destination offset, and resumes there.

This works but adds three kinds of operational cost:

  1. Translation map lag — the map may be stale at the exact moment of failover, forcing the consumer to replay or skip.
  2. Client-side translation awareness — consumers need MM2- compatible offset-reset logic; stock Kafka consumers don't know about the map.
  3. Offset-numbering divergence — after a failover the destination becomes the source for the return leg; offsets numbers have drifted.

Offset preservation removes all three costs. A consumer that knows its last source offset resumes at the same offset on the destination, full stop. This is Shadowing's canonical client-side simplification over MM2.

When it's feasible

Offset-preserving replication requires the destination cluster to accept records with externally-determined offsets rather than assign its own. This is a broker-internal capability — the standard Kafka producer API assigns offsets on produce; a broker that imports records and writes them into the same offset slot the source used needs to be doing so at a layer below the producer API.

This is why Shadowing is a broker- internal mechanism, not a Kafka Connect connector — a connector that produces records via the public API cannot preserve source offsets, regardless of how fancy its bookkeeping is. The feature has to live inside the broker's log layer.

Why offset preservation matters for streaming DR

Seconds-scale DR recovery times hinge on how fast consumers can resume:

  • With translation: restart consumers → block on map lookup → resolve potentially-stale translation → resume at approximate offset → potentially re-process or skip a window of records.
  • With offset preservation: restart consumers → point them at the shadow cluster → resume at the exact committed offset.

The difference is seconds and no data-processing ambiguity vs longer + non-zero window of replay/skip. For latency- sensitive consumers (real-time pipelines, reactive agents, dashboards) this is the load-bearing DR-readiness property.

Seen in

Last updated · 470 distilled / 1,213 read