Skip to content

PATTERN Cited by 1 source

Multi-region Raft quorum

Pattern

Replicate every partition (or shard) across N replicas distributed across R geographic regions under a per-partition Raft consensus group, so that writes are only acknowledged after a cross-region quorum (majority of the N replicas) has persisted the record. Leader election on region loss is handled automatically by Raft — the replicas in surviving regions elect a new leader from the in-sync subset, and the cluster continues serving the application layer with the same partition identity.

The pattern's load-bearing property is RPO = 0 against a region-level outage: no acknowledged write is lost, because the write was acknowledged only after a majority persisted it — and that majority spans regions.

The canonical realisation is a multi-region stretch cluster; the canonical contrast is async MirrorMaker2 replication between two independent clusters (non-zero RPO).

Canonical framing — Redpanda

"Data is replicated synchronously via raft protocol between brokers distributed across multiple regions … multi-region clusters have RPO=0 and very low RTO when there is a region-level outage. This is because new leaders are automatically elected — as part of the Raft protocol — in the surviving regions when any region goes down. The replication factor on the cluster or topics tells you how many region failures can be tolerated for the cluster to continue to serve the application layer."

(Source: sources/2025-02-11-redpanda-high-availability-deployment-multi-region-stretch-clusters)

Mechanics

Replica placement — region as rack

Replicas are placed across regions using the broker's rack- awareness machinery with rack = region identifier. Redpanda's three-broker Ansible template:

34.110.102.41  rack=us-west-2
35.236.35.49   rack=us-east-2
35.114.39.36   rack=eu-west-2

With enable_rack_awareness=true, the partition-replica-placement scheduler spreads each partition's replicas across distinct racks (= regions).

Commit rule — majority quorum

The Raft commit rule is: a log entry is committed when a majority of the replica set has persisted it. On RF=3 split across 3 regions, commit requires at least 2 of 3 regions to acknowledge — i.e., the leader's own region + at least one other.

This is what turns a multi-region topology into an RPO=0 shape: no committed write exists that has not survived at least a majority of regions.

Leader election — automatic failover

On region loss, the Raft re-election protocol runs within the surviving regions:

  • Surviving replicas detect leader heartbeat loss.
  • Replicas initiate leader election, exchanging term numbers and log-match votes.
  • A replica with the most up-to-date log receives a majority of votes and becomes the new leader.
  • The application layer reconnects to the new leader (or, via the cluster's routing metadata, is redirected automatically).

RTO = election timeout + client reconnect. On Redpanda / Raft, this is typically seconds.

Trade-offs to manage

The canonical five hazards the Redpanda post enumerates (all paid per write as soon as acks=all is the producer mode):

  1. Cross-region RTT per commit — each acks=all write waits for the quorum round-trip. Regional stretch (60-80 ms) is already significant; transoceanic stretch (150+ ms) is usually impractical for OLTP.
  2. Replication-bandwidth cost — every committed byte is replicated RF−1 times, mostly cross-region at cross-region bandwidth rates. See concepts/cross-region-bandwidth-cost.
  3. Compute overhead — replication is CPU work on every replica.
  4. Client routing sensitivity — mis-routed clients pay cross-region client-to-leader hops; see patterns/client-proximal-leader-pinning.
  5. Consistency vs availability — minority-region writes fail during a partition (Raft fails closed on minority).

Composing with proximity optimisations

The pattern composes with four operator knobs that reduce the client-facing cost without weakening the quorum-side property (except acks=1, which explicitly weakens durability):

  • Leader pinning: bias leadership to client-proximal region. Preserves quorum; eliminates client-to-leader cross-region hop.
  • Follower fetching: consumers read from closest replica. Preserves quorum (writes unaffected); introduces bounded read staleness.
  • Remote read replica topic: read-only object-storage-backed mirror cluster. Preserves quorum on the origin; scales read fan-out across clusters.
  • acks=1: relaxes quorum wait to leader-only. Trades durability against region loss for latency. Orthogonal to leader pinning (the two can be composed).

When to use this pattern

Pick multi-region Raft quorum when:

  • RPO = 0 is a hard requirement on the workload — financial ledger, transactional event log, mission-critical durable queue.
  • Regions are close enough (RTT under write-SLA budget) that cross-region quorum is tolerable.
  • Client traffic can be regionally localised (leader pinning + follower fetching) so only the replication-side hops are cross-region.
  • Single-control-plane operational model is preferable to running two independent clusters.

Pick async- replication-for-cross-region (MM2) instead when:

  • Write SLA cannot tolerate cross-region quorum.
  • Per-cluster availability matters more than cross-cluster consistency.
  • Cross-region bandwidth must be held to one replication stream per topic (MM2 sends each record across the WAN once; stretch-cluster sends each record RF−1 times across the WAN).

Seen in

Last updated · 470 distilled / 1,213 read