Redpanda — High availability deployment: Multi-region stretch clusters¶
Summary¶
Part four of Redpanda's HA/DR series frames
the multi-region stretch cluster — a single
Redpanda cluster whose brokers are distributed
across two or more cloud regions (or data-centre regions), with
per-partition Raft groups replicating
synchronously across all regions. The shape's canonical property is
RPO = 0 against a region-level outage: any committed write has
already been persisted in the surviving regions at commit time, and
Raft elects a new partition leader from an in-sync follower in a
surviving region automatically. This is the availability-maximising
alternative to running two independent clusters with
MirrorMaker2 async
replication (which carries a non-zero RPO plus the operational
cost of two clusters). The post is a pros-and-cons tour plus
four operator knobs that exist specifically because cross-region
synchronous replication is expensive:
leader pinning (pin partition leaders
to client-proximal regions), producer acks=1 (leader-only ack,
trade durability for latency), follower
fetching (consumer reads from closest replica rather than leader),
and remote read replica
topics (read-only topic backed by object storage in any region).
Closes with an Ansible hosts.ini pattern encoding region-as-rack
(rack=us-west-2, rack=us-east-2, rack=eu-west-2 on three
brokers) and a simulation-technique disclosure:
OpenMessaging Benchmark with
tc injecting inter-broker network latency to stand in for a real
multi-region deployment during performance testing. Tier-3
substrate-qualifying: distributed-systems internals (cross-region
Raft quorum), scaling trade-offs (latency / bandwidth cost / strong-
consistency-vs-availability), infrastructure architecture (stretch
topology + Ansible deployment + OMB simulation), production-relevant
guidance on RPO/RTO.
Key takeaways¶
-
Multi-region stretch cluster = single Raft cluster across regions with RPO=0. "A multi-region Redpanda cluster is a deployment topology that allows customers to run a single Redpanda cluster across multiple data centers or multiple cloud regions. It's often referred to as a stretch cluster, where a single cluster stretches across multiple geographic regions with data distributed across all deployment regions. Data is replicated synchronously via raft protocol between brokers distributed across multiple regions." The replication-factor dial sizes region-failure tolerance: "The replication factor on the cluster or topics tells you how many region failures can be tolerated for the cluster to continue to serve the application layer."
-
Stretch cluster vs MirrorMaker2 async mirror is the canonical high-availability-vs-performance axis. "Unlike in asynchronous replication, where you have two separate clusters with MirrorMaker2 replication between them and a non-zero RPO, multi- region clusters have RPO=0 and very low RTO when there is a region-level outage. This is because new leaders are automatically elected — as part of the Raft protocol — in the surviving regions when any region goes down." The stretch cluster pays the cross- region RTT on every write; MirrorMaker2 async pays zero write latency but concedes a non-zero RPO.
-
Five enumerated performance hazards of multi-region: (1) Network latency — "In a multi-region setup, producers and consumers may be located in different regions, which can increase the time it takes for data to be written or read." (2) Replication overhead — "Replicating data across regions requires additional network bandwidth and computational resources." (3) Cross-region bandwidth costs — "Transferring data between regions can incur significant bandwidth costs, especially in cloud environments where cross-region data transfer is billed separately." (4) Client-side routing and proximity — "Producers and consumers need to connect to the nearest region to minimize latency. However, improper routing can lead to sub- optimal performance." (5) Consistency vs. availability — "Raft ensures strong consistency in achieving quorum during writes in a multi-region setup. This ensures the maximum replicas across all regions have the same data simultaneously, which can increase latency."
-
Leader pinning is the first-line latency mitigation. "Leader pinning is a feature in Redpanda that lets you specify preferred locations for topic partition leaders and pin to specific regions in a multi-region cluster. Leader pinning ensures a topic's partition leaders are geographically closer to clients. This helps decrease networking costs and guarantees lower latency by routing produce/consume requests to brokers located in specific regions." Enterprise-licensed feature on Self-Managed + Redpanda Cloud. Canonicalises the client-proximal-leader optimization that eliminates cross-region hops from the write-side when clients can be assigned to a specific region's leadership partitions.
-
Producer
acks=1relaxes cross-region durability explicitly. "For scenarios where produce latency exceeds requirements, you can configure producers to useacks=1instead ofacks=all. This reduces latency by only waiting for the leader to acknowledge rather than the replication factor and quorum of brokers." Caveat verbatim: "However, this comes at the cost of potentially decreased message durability." Composes directly with acks-producer- durability — on a stretch cluster,acks=allis the per-write cross-region Raft quorum wait;acks=1leader-only ack trades durability against region outage for sub-RTT produce latency. -
Follower fetching = closest-replica consume. "Follower fetching is a feature in Redpanda that allows consumers to fetch records from the closest replica of a topic partition, regardless of whether it's a leader or a follower... Follower fetching helps reduce latency and potential costs associated with multi-region deployments by allowing consumers to read from geographically closer followers." Consumer-side analogue of leader pinning — symmetric treatment: pin write-path work to client-proximal leaders, redirect read-path work to client-proximal followers. This is the Kafka-API
KIP-392rack-aware consumer behaviour on a stretch cluster: the leader-only-serves-reads assumption is retired. -
Remote read replica topics = object-storage-backed read-only mirror decoupling read load from origin. "A Remote Read Replica topic is a read-only topic that mirrors a topic on a different cluster. It works with both Tiered Storage and archival storage. Remote Read Replicas allow you to create a separate remote cluster for consumers of a specific topic, populating its topics from remote storage. This can serve consumers without increasing the load on the origin cluster. These read-only topics access data directly from object storage instead of the topics' origin cluster, which means there's no impact on the performance of the original cluster." Architecture substrate is tiered storage — the origin cluster already offloads segments to S3/GCS, and the remote read replica pulls segments directly from object storage rather than from the origin brokers. Scales read fan-out without scaling the origin cluster's broker fleet.
-
Rack awareness = region identifier at deployment. "You can manually deploy such a cluster or create a cluster using our Ansible collection. Once you create the VMs in the regions of your choice, you can create a
hosts.inifile with the region information specified as rack information per broker." Worked three-broker example:rack=us-west-2,rack=us-east-2,rack=eu-west-2on three cross-region brokers; after provisioning,rpk cluster config get enable_rack_awarenessreturnstrueandrpk cluster status | grep RACK -A3shows the three brokers on distinct racks. Region-as-rack is load-bearing because Redpanda's rack-aware replica-placement logic uses the rack dimension to spread Raft group members across regions — same mechanism as multi-AZ rack awareness from part 3 of the series, different cardinality (regions not AZs). -
Simulation technique: OMB +
tcto inject inter-broker latency. "To simulate a multi-region Redpanda cluster, we set up a 3-node Redpanda cluster with i3en.xlarge VMs. These VMs have four cores per node with 32 MB of memory each, and simulate a Tier-2 Redpanda Cloud cluster. We usedtcto only add network latency between Redpanda brokers. No network latency was added between the OMB worker nodes and Redpanda broker nodes to simulate leader pinning." Published OMB driver + workload YAML:replicationFactor: 3,acks=all,linger.ms=1,batch.size= 131072, 144 partitions on 1 topic, 50 MB/s rate, 4 producers + 4 consumers, 5-min run + 5-min warm-up. The technique's general relevance is the ability to isolate the cross-broker latency dimension from the client-to-broker latency dimension — you inject only the former and study its effect on produce/consume tails without paying cross-region cloud bandwidth during testing.
Architecture primitives canonicalised¶
Multi-region stretch cluster — the topology¶
"A single Redpanda cluster across multiple data centers or multiple cloud regions ... data is distributed across all deployment regions. Data is replicated synchronously via raft protocol between brokers distributed across multiple regions and also accessible from various points globally."
Canonicalised as concepts/multi-region-stretch-cluster. The concept subsumes (1) a single control plane / admin identity; (2) cross-region Raft quorum on every partition; (3) per-region bandwidth costs aggregated at the broker level; (4) region as rack in the rack-aware replica-placement machinery.
Leader pinning — write-side client-proximal¶
"Leader pinning is a feature in Redpanda that lets you specify preferred locations for topic partition leaders and pin to specific regions in a multi-region cluster. Leader pinning ensures a topic's partition leaders are geographically closer to clients. This helps decrease networking costs and guarantees lower latency by routing produce/consume requests to brokers located in specific regions."
Canonicalised as concepts/leader-pinning +
patterns/client-proximal-leader-pinning. The pattern is the
dual of acks=1 — both reduce cross-region cost on the write
path; leader pinning preserves durability while biasing topology;
acks=1 degrades durability while leaving topology alone.
Follower fetching — read-side client-proximal¶
"Follower fetching is a feature in Redpanda that allows consumers to fetch records from the closest replica of a topic partition, regardless of whether it's a leader or a follower."
Canonicalised as concepts/follower-fetching + patterns/closest-replica-consume. The Kafka-API substrate is KIP-392 (rack-aware consumer). On a stretch cluster, follower fetching composes with leader pinning: leader pinning optimises the write side, follower fetching optimises the read side; neither requires changing the partition's leader assignment to serve in-region reads.
Remote read replica topic — object-storage-backed read fan-out¶
"Remote Read Replicas allow you to create a separate remote cluster for consumers of a specific topic, populating its topics from remote storage. This can serve consumers without increasing the load on the origin cluster."
Canonicalised as concepts/remote-read-replica-topic. The primitive is architecturally distinct from follower fetching: follower fetching reads from a replica in the same cluster (the origin's own follower broker); remote read replica reads from a separate cluster backed by the origin's tiered-storage bucket. The former scales read latency; the latter scales read fan-out without scaling origin brokers.
Consistency vs. availability — the canonical trade-off exposition¶
The post's most crisp architectural statement on the consistency-availability axis:
"Raft ensures strong consistency in achieving quorum during writes in a multi-region setup. This ensures the maximum replicas across all regions have the same data simultaneously, which can increase latency. If strong consistency is not an absolute requirement but availability is, at the expense of slightly older data, multiple independent Redpanda clusters across different regions with MM2 replication can be set up. This prioritizes cluster availability allowing regions to operate independently but can lead to slightly older data that is a factor of how quickly replication can occur."
Two deployment shapes map to two points on the consistency-availability axis:
| Shape | Consistency | Availability on region partition | RPO |
|---|---|---|---|
| Multi-region stretch (this post) | Strong (Raft quorum) | Write-unavailable on the minority-region side; read still works on any surviving region for the log prefix | 0 |
| MirrorMaker2 async (two clusters + replication) | Eventual | Each cluster stays writable in its own region; reads can be stale | Non-zero (= replication lag at outage time) |
The MM2 alternative shape corresponds to async-replication-for- cross-region's pattern — the key difference being that MM2 runs between two independent Kafka/Redpanda clusters rather than between replicas within a single cluster.
Operational numbers¶
- Ansible 3-region deployment: three brokers, one broker per
region,
rack=us-west-2 / us-east-2 / eu-west-2. Minimal example but load-bearing as a pattern template. - OMB simulation substrate: 3×
i3en.xlarge(4 vCPU, 32 GB memory per node), simulates a Tier-2 Redpanda Cloud cluster. - Workload: 50 MB/s, 1 topic × 144 partitions, 4 producers + 4 consumers, 1024-byte messages with 50% random bytes, 5-min warm- up + 5-min test.
- Producer config:
acks=all,linger.ms=1,batch.size=131072(128 KB),request.timeout.ms=300000. - Consumer config:
auto.offset.reset=earliest,enable.auto.commit=false. - Latency measurement technique:
tcinter-broker only — no tc between OMB workers and brokers, "to simulate leader pinning." - Benchmark outcome numbers: the post refers to a "Publish latency chart for different lower stretch configurations" and "all runs were on a Tier-2 cluster and achieved a throughput of 50 MBps" but the specific per-percentile numbers are delivered only via an inline image — not extractable from the text.
Cross-source continuity¶
- Four-part series: this is part 4. Parts 1–3 cover single-AZ HA, partition leadership + rack awareness, and multi-AZ deployment respectively. Part 4 extends the rack dimension from AZ to region.
- Batch-tuning sibling: James Kinley's
sources/2024-11-19-redpanda-batch-tuning-in-redpanda-for-optimized-performance-part-1|2024-11-19 part 1
and
sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2|2024-11-26 part 2
canonicalise producer-side batching substrate; this post names
linger.ms=1,batch.size=131072,acks=allin the OMB driver config as the exact producer knobs tuned. The stretch-clusteracks=1exception walked here pairs with the 2024-11-19 part 1's "acks durability scope deliberately excluded" caveat — this post is whereacks-on-stretch-cluster gets canonicalised. - Kafka-API equivalence: the four operator knobs (leader
pinning,
acks=1, follower fetching, remote read replica) correspond to KIP-392 (rack-aware consumer),acks=1(Kafka durability dial), and no direct equivalent for remote-read- replica in upstream Kafka — Redpanda's object-storage-backed read-only mirror is substrate-specific.
Caveats & reproducibility gaps¶
- Pedagogy / marketing voice: no customer case study; no retrospective incident narrative; no post-mortem from a real region-outage event.
- Benchmark image-only: the publish-latency chart is delivered
as an embedded image; specific p50 / p99 / p99.9 numbers at
different stretch configurations and
acksvalues are not in the body text — reproducibility requires reading the image. tclatency values not disclosed: the exact inter-broker latency injected during simulation (e.g., 30 ms for cross-AZ, 60-80 ms for cross-region, 150+ ms for transoceanic) is not published. The OMB workload YAML gives deterministic reproducibility on the workload side; the latency dimension is effectively a free parameter.- Region failure rate unstated: "regional outages can still happen" + LA-fires gesture, but no quantitative baseline (e.g., historical AWS regional outage rate, Redpanda-observed multi-region cluster failure-recovery time distribution).
- Write availability under partition not analysed: the post frames availability in terms of RPO/RTO for region-level outages but does not distinguish a clean-region-death from a network-partition case, where the minority-region side becomes write-unavailable but still reachable. The Raft quorum semantics handle this correctly but are not walked.
acks=1+ follower fetching composition not discussed: the two mitigations interact —acks=1means a write is durable on the leader only, and follower fetching means consumers may read from followers that have not yet replicated the leader's write. The read-your-writes window onacks=1 + follower_fetch=trueis bounded by replica-fetch latency; not discussed.- No discussion of control-plane partitioning: if the admin control plane (rpk, Redpanda Console) lives in one region and that region is lost, is the cluster still reconfigurable from a surviving region? Implicit but not walked.
- No cost framing: cross-region bandwidth cost is flagged as a hazard (item 3 of 5) but not quantified in $/TB or $/million- writes. Reader is referred to a separate calculate cloud data transfer costs post for the actual numbers.
- K8s-deployment gap: "Self-Managed on K8s currently supports only multi-AZ deployments in all the cloud providers." — multi- region stretch is only available on VM / bare-metal / cloud- compute instance deployments or on Redpanda Cloud Dedicated + BYOC. Kubernetes operator gap is a current-state limitation.
- Leader-pinning is enterprise-licensed: a soft-paywall caveat — the first-line latency mitigation is locked behind a commercial licence on Self-Managed.
- Replication factor upper bound on real multi-region deployments: RF=5 or RF=7 on a 3-region stretch means multiple replicas per region, which works but ties the region-failure- tolerance claim ("how many region failures can be tolerated") to replica-distribution details the post does not walk.
Source¶
- Original: https://www.redpanda.com/blog/multi-region-stretch-clusters
- Raw markdown:
raw/redpanda/2025-02-11-high-availability-deployment-multi-region-stretch-clusters-i-acca7d9e.md
Related¶
- systems/redpanda
- systems/kafka — Kafka-API equivalence:
acksand KIP-392 (rack-aware consumer) are upstream Kafka primitives. - systems/openmessaging-benchmark — simulation substrate.
- concepts/multi-region-stretch-cluster
- concepts/leader-pinning
- concepts/follower-fetching
- concepts/remote-read-replica-topic
- concepts/mirrormaker2-async-replication
- concepts/rpo-rto — RPO=0 is the canonical stretch-cluster property.
- concepts/strong-consistency
- concepts/acks-producer-durability —
acks=1on stretch cluster as durability-relaxation knob. - concepts/in-sync-replica-set
- concepts/leader-follower-replication
- concepts/cross-region-bandwidth-cost
- patterns/multi-region-raft-quorum — the pattern canonicalised by this post.
- patterns/client-proximal-leader-pinning
- patterns/closest-replica-consume
- patterns/async-replication-for-cross-region — the alternative shape (MM2).
- patterns/tc-latency-injection-for-geo-simulation
- companies/redpanda