Skip to content

PLANETSCALE 2020-10-02

Read original ↗

PlanetScale — MySQL semi-sync replication: durability, consistency, and split brains

Summary

Shlomi Noach walks through the semantics of MySQL semi-synchronous replication: how it works, what it actually guarantees (durability, not consistency), and the specific topologies under which it still admits split-brain on datacenter isolation. The post uses a canonical "1-n" setup — one primary + n semi-sync replicas with rpl_semi_sync_master_wait_for_slave_count=1 — to enumerate failure modes, and concludes that even with a majority of sites up, a minority quorum of "primary + any one replica" can produce split-brain because semi-sync was designed around per-replica acks, not consensus quorums. The piece ends by pointing toward Sugu Sougoumarane's Consensus Algorithms at Scale series for the "reliable minority consensus" alternative.

Key takeaways

  1. Semi-sync is an asynchronous-replication upgrade, not a consensus protocol. A commit on the primary blocks until rpl_semi_sync_master_wait_for_slave_count semi-sync replicas have persisted the changelog to their relay logs — "received and written", not "applied". Replicas apply later, asynchronously. The changelog sits in the binary log on the primary and the relay log on each replica. "At least that number of replicas have written the changelog onto their relay logs where the change is now persisted. The replicas will apply the change later on, normally as soon as they can." (Source: sources/2026-04-21-planetscale-mysql-semi-sync-replication-durability-consistency-and-split-brains)

  2. The infinite-timeout configuration is the durability-first posture. rpl_semi_sync_master_timeout controls how long the primary waits before falling back to asynchronous replication. Noach's discussion deliberately assumes "an 'infinite' timeout, which we will accept to be a very large number" — the default behaviour of "timeout and degrade to async" silently breaks the durability guarantee, which is why operators who care about semi-sync durability tune the timeout effectively to infinity. Canonicalised as patterns/infinite-semi-sync-timeout.

  3. Non-semi-sync replicas can be more up-to-date than semi-sync ones. Semi-sync replicas acknowledge receipt; other replicas just pull binlog too, at whatever rate. "It's possible that a non-semi-sync replica has pulled more changelog data than some or all semi-sync replicas." On failover, operators can either use the more-current non-semi-sync replica as the seed or discard it and use the most-current semi-sync replica — a pure operational trade-off between capacity and recovery time. (Source: sources/2026-04-21-planetscale-mysql-semi-sync-replication-durability-consistency-and-split-brains)

  4. "If the primary tells you a commit is successful, then the data is durable elsewhere" — that's the only promise. Pre-ack state is explicitly undefined: if an UPDATE is on the primary but no one else yet, and the primary crashes before ack, the user cannot distinguish between "commit succeeded but ack lost in transit" and "commit never happened". This is a feature of the contract, not a defect — consistent with how async replication frames the pre-ack window. See concepts/durability-vs-consistency-guarantee for the canonicalised shape.

  5. Semi-sync guarantees durability but NOT consistency after DC isolation. Noach's canonical counterexample: primary + R1 are both in DC-A; R2/R3/R4 are in DC-B. An UPDATE arrives, R1 acks quickly (same-DC latency), primary commits and tells user "OK". Then DC-A network-isolates. "The loss of network cut short the delivery of the change log to R2, R3, R4. They never got it. Should we promote either R2, R3, or R4, we lose consistency. The app expects the result of that UPDATE to be there, but neither of these servers have the data." The data is durable (on primary + R1), but inaccessible — and a predefined failover plan may accept that data loss to cut outage time. (Source: sources/2026-04-21-planetscale-mysql-semi-sync-replication-durability-consistency-and-split-brains)

  6. Losing ≥ rpl_semi_sync_master_wait_for_slave_count semi-sync replicas destroys the consistency argument. If the primary plus the lone ack'ing replica go down together, you cannot tell whether the remaining semi-sync replicas are current. "If the number of lost semi-sync replicas is equal to, or greater than rpl_semi_sync_master_wait_for_slave_count, then we do not know whether the remaining replicas are consistent." The only way to know is to reach the primary's binlog; if you can't, you're guessing. (Source: sources/2026-04-21-planetscale-mysql-semi-sync-replication-durability-consistency-and-split-brains)

  7. Split-brain is possible even when the majority of sites are up. The core finding: in the 1-n setup with 5+ sites, losing 2 sites can still produce split-brain. "In our '1-n' scenario, we have a quorum of two servers out of five or more. The primary, with a single additional replica, are able to form a quorum and to accept writes. That's how we got to have a split brain. While R2, R3, R4 form a majority of the servers, writes took place without their agreement." Canonicalised as concepts/minority-quorum-writeability.

  8. Paxos/Raft readers will find this surprising — but it's correct. Noach closes by noting that the setup baffles consensus-familiar engineers: "People familiar with Paxos and Raft consensus protocols may find this baffling. However, reliable minority consensus is achievable, and Sugu Sougoumarane's Consensus Algorithms series of posts continues to describe this." The series is the sysdesign-wiki's authoritative treatment of minority-safe consensus on top of MySQL primitives.

  9. DC placement is the actual failure-tolerance knob. Noach walks through four deployment shapes — semi-sync in same DC (fast but DC-isolation-hostile), semi-sync only in remote DCs (slow writes but durable on outage), and mixed — each trading write latency vs post-failure options. The abstract: "The geo-distribution of our servers plays a key part in how tolerant our system is for failure and what outcomes we can expect." Canonicalised as patterns/cross-dc-semi-sync-for-durability.

  10. Recovery from DC isolation is a choice between downtime, data loss, and split-brain. Three options are laid out: wait out the outage (downtime), promote a remote replica (risk inconsistency + split-brain if old DC comes back), or do controlled reparenting with fenced old primary. Each has its own probability-times-cost arithmetic that operators should pre-compute.

Systems extracted

  • MySQL — the host system; the semi-sync plugin and its rpl_semi_sync_master_* variables are MySQL-specific.
  • Vitess — the production substrate where PlanetScale actually manages the hazards discussed (via PRS/ERS reparenting, vtgate query buffering, anti-flapping).
  • PlanetScale — the author's employer; post is tagged as a PlanetScale blog article.

Concepts extracted

  • MySQL semi-sync replication (NEW) — the canonical concept page for the mechanism itself.
  • Semi-sync timeout fallback (NEW) — the silent-degradation-to-async behaviour controlled by rpl_semi_sync_master_timeout.
  • Durability vs. consistency guarantee (NEW) — the orthogonal-axes distinction Noach relies on throughout: semi-sync buys one, not the other.
  • Minority-quorum writeability (NEW) — the structural reason semi-sync admits split-brain despite most sites being up: 2 of 5 is enough to commit, but not enough to preclude a second leader.
  • Split-brain (EXISTING, updated) — cross-linked; semi-sync is a specific manifestation substrate.
  • MySQL semi-sync split-brain (EXISTING, updated) — Sugu's crash-restart flavour; this post is the canonical framing of the topology-induced flavour.
  • Asynchronous replication (EXISTING, updated) — the substrate semi-sync upgrades; the pre-ack-lost window is the same.
  • Binlog replication (EXISTING, updated) — the log semi-sync synchronises on.

Patterns extracted

  • Cross-DC semi-sync for durability (NEW) — the deployment shape where semi-sync replicas live outside the primary's DC so commits are durable to outside-DC storage before acking.
  • Infinite semi-sync timeout (NEW) — the durability-first configuration that disables silent fallback to async.
  • Pluggable durability rules (EXISTING, cross-linked) — Sugu's framing of the general architectural response to exactly the inflexibility Noach describes.

Operational numbers / configuration facts

Setting Role
rpl_semi_sync_master_wait_for_slave_count Number of replicas that must ack before commit returns. Must be ≥ 1 to use semi-sync.
rpl_semi_sync_master_timeout How long the primary waits for acks before falling back to async. Durability-critical deployments set this effectively to infinity.
Binary log Sequential, per-transaction-ordered log on the primary. Ack of any event implies ack of all prior events.
Relay log Per-replica log where received (but not yet applied) events live. Durability is satisfied when the event hits the relay log; apply is separate.

The key numerical fact Noach lands on: in a 5-site "1-n" deployment, 2-site outages can induce split-brain despite 3 sites being up. No specific probability is quoted; Noach instead asks readers to do the vendor-specific arithmetic.

Caveats

  • Scoped to MySQL ≤ 5.7 or equivalent. Noach explicitly narrows the discussion: "We limit our discussion to MySQL 5.7 or equivalent." Later MySQL versions (group replication, InnoDB cluster) shift the picture somewhat.
  • Assumes replicas are not deliberately stopped. Delayed-replica and stopped-replica scenarios are out of scope.
  • No numerical availability model. Noach presents the shape of the trade-offs but leaves the probability math (two-site-outage rate × outage-duration × failover-cost) to readers. This is a feature of the post — the arithmetic is vendor-specific.
  • Split-brain-recovery cost is gestured at, not quantified. Fixing divergent replication trees is described as "difficult, and normally we revert the changes on one to make it look like the other"; the post doesn't dig into gh-mysql-rewind-style operational tooling in depth.
  • The "reliable minority consensus" pointer is a cliffhanger. Noach directs readers to Sugu Sougoumarane's series for the solution shape; the pointers are correct — the sysdesign-wiki has the full Consensus Algorithms at Scale series ingested, starting with Part 1 on lock-based-over-lock-free trade-offs.

Source

Last updated · 378 distilled / 1,213 read