PATTERN Cited by 1 source
Multi-region Raft quorum¶
Pattern¶
Replicate every partition (or shard) across N replicas distributed across R geographic regions under a per-partition Raft consensus group, so that writes are only acknowledged after a cross-region quorum (majority of the N replicas) has persisted the record. Leader election on region loss is handled automatically by Raft — the replicas in surviving regions elect a new leader from the in-sync subset, and the cluster continues serving the application layer with the same partition identity.
The pattern's load-bearing property is RPO = 0 against a region-level outage: no acknowledged write is lost, because the write was acknowledged only after a majority persisted it — and that majority spans regions.
The canonical realisation is a multi-region stretch cluster; the canonical contrast is async MirrorMaker2 replication between two independent clusters (non-zero RPO).
Canonical framing — Redpanda¶
"Data is replicated synchronously via raft protocol between brokers distributed across multiple regions … multi-region clusters have RPO=0 and very low RTO when there is a region-level outage. This is because new leaders are automatically elected — as part of the Raft protocol — in the surviving regions when any region goes down. The replication factor on the cluster or topics tells you how many region failures can be tolerated for the cluster to continue to serve the application layer."
(Source: sources/2025-02-11-redpanda-high-availability-deployment-multi-region-stretch-clusters)
Mechanics¶
Replica placement — region as rack¶
Replicas are placed across regions using the broker's rack- awareness machinery with rack = region identifier. Redpanda's three-broker Ansible template:
With enable_rack_awareness=true, the partition-replica-placement
scheduler spreads each partition's replicas across distinct racks
(= regions).
Commit rule — majority quorum¶
The Raft commit rule is: a log entry is committed when a majority of the replica set has persisted it. On RF=3 split across 3 regions, commit requires at least 2 of 3 regions to acknowledge — i.e., the leader's own region + at least one other.
This is what turns a multi-region topology into an RPO=0 shape: no committed write exists that has not survived at least a majority of regions.
Leader election — automatic failover¶
On region loss, the Raft re-election protocol runs within the surviving regions:
- Surviving replicas detect leader heartbeat loss.
- Replicas initiate leader election, exchanging term numbers and log-match votes.
- A replica with the most up-to-date log receives a majority of votes and becomes the new leader.
- The application layer reconnects to the new leader (or, via the cluster's routing metadata, is redirected automatically).
RTO = election timeout + client reconnect. On Redpanda / Raft, this is typically seconds.
Trade-offs to manage¶
The canonical five hazards the Redpanda post enumerates (all paid
per write as soon as acks=all is the producer mode):
- Cross-region RTT per commit — each
acks=allwrite waits for the quorum round-trip. Regional stretch (60-80 ms) is already significant; transoceanic stretch (150+ ms) is usually impractical for OLTP. - Replication-bandwidth cost — every committed byte is replicated RF−1 times, mostly cross-region at cross-region bandwidth rates. See concepts/cross-region-bandwidth-cost.
- Compute overhead — replication is CPU work on every replica.
- Client routing sensitivity — mis-routed clients pay cross-region client-to-leader hops; see patterns/client-proximal-leader-pinning.
- Consistency vs availability — minority-region writes fail during a partition (Raft fails closed on minority).
Composing with proximity optimisations¶
The pattern composes with four operator knobs that reduce the
client-facing cost without weakening the quorum-side
property (except acks=1, which explicitly weakens durability):
- Leader pinning: bias leadership to client-proximal region. Preserves quorum; eliminates client-to-leader cross-region hop.
- Follower fetching: consumers read from closest replica. Preserves quorum (writes unaffected); introduces bounded read staleness.
- Remote read replica topic: read-only object-storage-backed mirror cluster. Preserves quorum on the origin; scales read fan-out across clusters.
acks=1: relaxes quorum wait to leader-only. Trades durability against region loss for latency. Orthogonal to leader pinning (the two can be composed).
When to use this pattern¶
Pick multi-region Raft quorum when:
- RPO = 0 is a hard requirement on the workload — financial ledger, transactional event log, mission-critical durable queue.
- Regions are close enough (RTT under write-SLA budget) that cross-region quorum is tolerable.
- Client traffic can be regionally localised (leader pinning + follower fetching) so only the replication-side hops are cross-region.
- Single-control-plane operational model is preferable to running two independent clusters.
Pick async- replication-for-cross-region (MM2) instead when:
- Write SLA cannot tolerate cross-region quorum.
- Per-cluster availability matters more than cross-cluster consistency.
- Cross-region bandwidth must be held to one replication stream per topic (MM2 sends each record across the WAN once; stretch-cluster sends each record RF−1 times across the WAN).
Seen in¶
- sources/2025-02-11-redpanda-high-availability-deployment-multi-region-stretch-clusters — canonical wiki statement of multi-region Raft quorum as the RPO=0 shape; automatic Raft re-election on region loss; replication-factor-sets-tolerance calibration; region-as-rack deployment template.
Related¶
- systems/redpanda, systems/kafka
- concepts/multi-region-stretch-cluster — the shape this pattern produces.
- concepts/rpo-rto — RPO=0 property.
- concepts/strong-consistency
- concepts/acks-producer-durability
- concepts/in-sync-replica-set
- concepts/leader-follower-replication
- concepts/cross-region-bandwidth-cost
- concepts/mirrormaker2-async-replication
- patterns/leader-based-partition-replication — the intra-cluster pattern this extends to regional span.
- patterns/client-proximal-leader-pinning
- patterns/closest-replica-consume
- patterns/async-replication-for-cross-region — the alternative pattern.