PLANETSCALE 2020-09-26

PlanetScale — Consensus algorithms at scale: Part 3 — Use cases¶

Summary¶

Third instalment of Sugu Sougoumarane's Consensus algorithms at scale series on the PlanetScale blog (originally 2020-09-26; re-fetched via RSS 2026-04-21; Sugu is Vitess co-creator, ex-YouTube, PlanetScale CTO at the time). Part 3 is the durability-as-use-case-dependent instalment — it takes the framework established in Parts 1-2 (practical-scenario-ready consensus with single-leader narrowing and durability-agnostic rules) and exercises it on four concrete production deployment shapes that are all "uncomfortable for a majority-based consensus system." The essay's load-bearing move: decouple the durability requirement from the election algorithm by a single predicate over node sets, and the election algorithm's reachability requirement derives mechanically from that predicate via intersecting-quorum arithmetic.

The four use-case configurations (framed verbatim as "loosely derived from real production workloads"):

Many-replica, fifteen-primary, three-DC — "We have a large number of replicas spread over many data centers. Of these, we have fifteen leader capable nodes spread over three data centers. We don't expect two nodes to go down at the same time. Network partitions can happen, but only between two data centers; a data center will never be totally isolated. A data center can be taken down for planned maintenance."
Four zones, one node each — "We have four zones with one node in each zone. Any node can fail. A zone can go down without notice. A partition can happen between any two zones."
Six-node, three-zone — "We have six nodes spread over three zones. Any node can fail. A zone can go down without notice. A partition can happen between any two zones."
Two-region, two-zones-per-region — "We have two regions, each region has two zones. We don't expect more than one zone to go down. A region can be taken down for maintenance, in which case we want to proactively transfer writes to the other region."

Sugu's observation: "These configurations are all uncomfortable for a majority based consensus system." Traditional Paxos/Raft hard-codes a majority-quorum rule that doesn't map naturally onto any of these shapes. More importantly — users want to experiment with even more creative combinations, which the fixed-majority assumption structurally prevents: "these flexibilities will encourage users to experiment with even more creative combinations and allow them to achieve better trade-offs."

The mechanism that unifies all four: the Request algorithm and the Election algorithm share a common view of the durability requirements but can otherwise operate independently. Specifically:

If durability is specified as k nodes, the Request path acks the client after k acks.
The Election path, to guarantee safety, must reach (N − k + 1) nodes to intersect with any possible durability-satisfying write set.
For majority quorum (k = (N/2)+1), both paths reach (N/2)+1 nodes — a symmetric special case.
For k = 2 on N = 5: request waits for 2 acks, election reaches 4 nodes — asymmetric but correct.

The essay shows the worked 2-of-5 case in detail: "if durability is achieved with 2/5 nodes, then the election algorithm needs to reach 4/5 nodes to intersect with the durability criteria. In the case of a majority quorum, both of these are 3/5. But our generalization will work for any arbitrary property." This is the FlexPaxos intersecting-quorum generalisation stated in plain architectural English, two years before Part 8 would cite Howard/Malkhi/Spiegelman's 2017 paper by name. (Part 3 is the first time the generalisation appears in the series.)

The failure-tolerance envelope: once you define the envelope ("we don't expect two nodes to fail at the same time"), everything else derives. On 5 nodes, two-node failure exceeds the envelope → only three nodes reachable → election has to assume the worst case that the two unreachable nodes hold a durable-but-not-yet-propagated write → election stalls → operators must manually compromise: "abandon the two nodes and move forward. Otherwise, the loss of availability may become more expensive than the potential loss of that data." The availability-vs-data-loss trade-off is an operational decision that the protocol should expose rather than pre-decide.

Cross-boundary durability as a mitigation: a two-node durability on a single-DC topology is structurally vulnerable to the rare but real "leader + one acker both crash simultaneously after the ack but before propagating to the rest" failure mode. Mitigation: "require the ackers to live across network boundaries." The underlying node-count is unchanged (still k = 2) but the predicate is tightened from "any 2 nodes" to "2 nodes across network boundaries" — and the rarity of cross-boundary correlated failure makes this acceptable in practice: "This failure mode is rare enough that many users treat this level of risk as acceptable."

The request-vs-election frequency asymmetry — the load-bearing design principle: "The most common operation performed by a consensus system is the completion of requests. In contrast, a leader election generally happens in two cases: taking nodes down for maintenance, or upon failure. Even in a dynamic cloud environment like Kubernetes, it would be surprising to see more than one election per day for a cluster, whereas such a system could be serving hundreds of requests per second. That amounts to many orders of magnitude in difference between a request being fulfilled and a leader election." Canonical conclusion: "This means that we must do whatever it takes to fine tune the part that executes requests, whereas leader elections can be more elaborate and slower. This is the reason why we have a bias towards reducing the durability settings to the bare minimum."

YouTube production datum canonicalised: "At YouTube, although the quorum size was big, a single ack from a replica was sufficient for a request to be deemed completed. On the other hand, the leader election process had to chase down all possible nodes that could have acknowledged the last transaction. We did consciously trade off on the number of ackers to avoid going on a total wild goose chase." Four decisions bundled into one disclosure: (a) single-ack completion was the operational default on the request path despite a larger quorum membership; (b) election explicitly ran wider — "chase down all possible nodes that could have acknowledged"; (c) the election-election-width knob was consciously bounded so election wouldn't become a "total wild goose chase"; (d) the trade-off axis is explicit — request throughput + tail latency vs election-time completeness.

Architectural contribution to the wiki: Part 3 provides the foundational argument that the two architectural recommendations of Part 8 — pluggable durability and the yes-consensus-with-separation posture — rest on. This post is where "durability is use-case dependent, we made it an abstract requirement requiring the consensus algorithms to assume nothing about the durability requirements" first becomes load-bearing. All subsequent instalments of the series (Parts 4-8) compose atop this foundation. Sixth canonical Sugu-series disclosure on the wiki (after Parts 4, 5, 6, 7, 8), filling the gap between the series' opening (Parts 1-2, not yet ingested) and the revocation/establishment material (Part 4).

Key takeaways¶

Durability is use-case dependent, not protocol-fixed. "Durability is use-case dependent, we made it an abstract requirement requiring the consensus algorithms to assume nothing about the durability requirements." (Source: sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-3-use-cases). Canonical wiki framing of the durability-as-use-case-dependent concept. The consensus protocol exposes a predicate over node-set shape; the deployment provides the specific predicate (node count, zone count, region count, cross-boundary constraint) per workload. The protocol does not decide durability for the operator.
Four production shapes that majority-quorum cannot express cleanly. The fifteen-primary-three-DC, four-zone-single-node, six-node-three-zone, and two-region-two-zones-per-region shapes are each "uncomfortable for a majority based consensus system" — their operational reality (e.g., "we don't expect two nodes to fail simultaneously" as the failure envelope; DC-maintenance as a planned event; region-maintenance with proactive write transfer) is not expressible in the N-of-M arithmetic Paxos/Raft is built around. Canonical wiki datum that the fixed-majority assumption is a modelling mismatch, not an engineering trade-off.
Common view of durability, independent operation of Request and Election. "There is a way to reason about why this flexibility is possible. This is because the two cooperating algorithms (Request and Election) share a common view of the durability requirements, but can otherwise operate independently." Canonical wiki decomposition: durability predicate P is the single source of truth shared between request (satisfy P to ack) and election (reach enough nodes to ensure intersection with any set that satisfies P). Everything else about the two algorithms is orthogonal.
Intersecting quorums in plain English: 2-of-5 durability ↔ 4-of-5 election reach. "If durability is achieved with 2/5 nodes, then the election algorithm needs to reach 4/5 nodes to intersect with the durability criteria. In the case of a majority quorum, both of these are 3/5. But our generalization will work for any arbitrary property." Canonical wiki worked example: for N = 5 and k = 2, the election quorum (N − k + 1) = 4 is tighter than majority (3) — the flexibility to lower the write quorum comes at the cost of a wider read/election reach, and the system is correct as long as the two intersect on ≥ 1 node. This is the FlexPaxos generalisation — stated here 2 years before Part 8 cites the paper by name.
Out-of-envelope failures force an operational compromise. "In the above five node case, if two nodes fail, the failure tolerance has been exceeded. We can only reach three nodes. If we don't know about the state of the other two nodes, we will have to assume the worst case scenario that a durable request could have been accepted by the two unreachable nodes. This will cause the election process to stall. If this were to happen, the system has to allow for a compromise: abandon the two nodes and move forward. Otherwise, the loss of availability may become more expensive than the potential loss of that data." Canonical wiki datum of the availability-vs-data-loss trade-off: the protocol cannot make this decision — the operator does, and the protocol should expose the knob.
Cross-boundary durability as a targeted mitigation without raising the node count. "This type of failure can happen if the leader and the recipient node are network partitioned from the rest of the cluster. We can mitigate this failure by requiring the ackers to live across network boundaries. The likelihood of a replica node in one cell failing after an acknowledgment, and a master node failing in the other cell after returning success, is much lower." Canonical wiki datum that the durability predicate itself is the correct place to encode cross-boundary constraints — predicate tightening beats node-count inflation. Keeps the same k = 2 performance profile on the request path.
Request/election frequency asymmetry is orders of magnitude — this is the optimisation axis. "Even in a dynamic cloud environment like Kubernetes, it would be surprising to see more than one election per day for a cluster, whereas such a system could be serving hundreds of requests per second. That amounts to many orders of magnitude in difference between a request being fulfilled and a leader election. This means that we must do whatever it takes to fine tune the part that executes requests, whereas leader elections can be more elaborate and slower." Canonical wiki articulation of the common-case-frequency-asymmetry pattern at the consensus-protocol altitude. Request path = per-second cost = tail-latency-sensitive; election path = daily-event cost = correctness-sensitive, slower OK.
Bias toward reducing durability to the bare minimum. "This is the reason why we have a bias towards reducing the durability settings to the bare minimum. Expanding this number can adversely affect performance, especially the tail latency." Canonical wiki datum: more ackers = worse tail latency (slowest-of-N effect). The pragmatic default is the smallest k that satisfies operational-risk tolerance. The pattern generalises beyond consensus into any multi-replica ack path where the slowest replica is on the critical path.
"I have not seen anyone ask for a durability requirement of more than two nodes." Canonical wiki operational datum on the observed upper bound of production durability rules: "I have not seen anyone ask for a durability requirement of more than two nodes. But this may be due to difficulties dealing with corner cases that MySQL introduces due to its semi-sync behavior. On the other hand, these settings have served the users well so far. So, why become more conservative?" Sugu's aside flags two things: (a) the observed-upper-bound may be an artifact of MySQL semi-sync's corner-case-handling cost, not a fundamental engineering preference; (b) even so, k = 2 has served users well, so there is no engineering pressure to go higher.
YouTube: single-ack completion, wide election reach, conscious bounded search. "At YouTube, although the quorum size was big, a single ack from a replica was sufficient for a request to be deemed completed. On the other hand, the leader election process had to chase down all possible nodes that could have acknowledged the last transaction. We did consciously trade off on the number of ackers to avoid going on a total wild goose chase." Canonical wiki production-instance disclosure: YouTube ran k = 1 on the request path despite a big replica set, and paid for it with a wider election-time scan. Canonical instance of the single-ack-completion-with-wider-election pattern. The "conscious trade-off on the number of ackers" clause is the rare first-person disclosure that the election-time scan width was bounded by design, not unbounded.
The ack-recipient rare-crash failure mode is explicit. "A two-node durability does not always mean that the system will stall or lose data. A very specific sequence of failures have to happen: Leader accepts a request. Leader attempts to send the request to multiple recipients. Only one recipient receives and acknowledges the request. Leader returns a success to the client. Both the leader and that recipient crash." Canonical wiki failure-mode walkthrough: k = 2 durability is only vulnerable to this exact sequence of events, which is how Sugu frames the rarity argument. Any other failure sequence either (a) the leader never acks the client (so no data was promised) or (b) at least one non-crashed node holds the write.
Durability-as-plugin is the structural consequence of the framework's flexibility. "If there was no need to worry about a majority quorum, we would have the flexibility to deploy any number of nodes we require. We can designate any subset of those nodes to be eligible leaders, and we can make durability decisions without being influenced by the above two decisions. This is exactly what many users have done with Vitess." Canonical wiki anchor that "this is exactly what many users have done with Vitess" is the series' first explicit naming of Vitess as the worked production instance of the framework — foreshadowing Part 8's explicit pluggable-durability recommendation by 5 instalments.

Systems¶

Vitess — the canonical worked production instance of durability-as-plugin. "This is exactly what many users have done with Vitess." First explicit Vitess naming in the series.
MySQL — the replication substrate whose semi-sync behaviour creates the corner-case-handling pain that (per Sugu) may explain the observed upper bound on production durability rules.
PlanetScale — publisher of the series; Sugu was PlanetScale CTO at time of writing.
YouTube (referenced) — production instance of the single-ack-completion + wide-election-reach trade-off. No dedicated systems/youtube.md page; referenced informally.

Concepts¶

Durability as use-case dependent (new) — durability is an operator-specified predicate over node sets, not a protocol-fixed constant.
Intersecting quorums (new) — the FlexPaxos generalisation that lets write-quorum k and read/election-quorum (N − k + 1) vary independently as long as they intersect on ≥ 1 node.
Failure-tolerance envelope (new) — the operator-specified bound on concurrent failures the system is engineered to survive; out-of-envelope failures force a manual availability-vs-data-loss compromise.
Availability-vs-data-loss trade-off (new) — when failure tolerance is exceeded, stalling the system preserves correctness; abandoning unreachable nodes preserves availability. The protocol exposes the knob; the operator chooses.
No distributed consensus — existing wiki page; Part 3's contribution is the yes-consensus side of the same axis, showing that consensus with a pluggable durability predicate can accommodate deployment shapes that majority-quorum cannot.
Durable request — existing wiki page (from Part 6); Part 3 establishes that durability is satisfied when the operator-defined predicate is met, not necessarily when a majority acks.
Minority-quorum writeability — existing wiki page (from Noach's semi-sync post); Part 3 reframes this as not a failure mode but a legitimate design point when paired with an election predicate that compensates with wider reach.
Durability vs consistency guarantee — existing wiki page (from Noach's semi-sync post); Part 3's framework ensures that a durability-predicate + election-predicate pair satisfying the intersection property delivers both.

Patterns¶

Pluggable durability rules — existing pattern (canonicalised from Part 8); Part 3 is the foundational argument for the pattern: the four use-case configurations are the motivating reason why hard-coded majority-quorum is the wrong shape.
Optimize for common-case frequency asymmetry (new) — generalises the "request-path per-second vs election-path per-day, orders of magnitude apart" framing into a pattern that applies beyond consensus to any bifurcated-cost-axis architecture.
Single-ack completion with wider election reach (new) — the canonical YouTube production instance: k = 1 on the request path paired with election-time scan of "all possible nodes that could have acknowledged the last transaction" and a conscious bound on election width to avoid unbounded search.

Operational numbers¶

"Hundreds of requests per second" per cluster (order-of-magnitude, typical).
"More than one election per day for a cluster would be surprising" (upper bound, even in dynamic Kubernetes deployments).
Clock skew between election events / node failures: not disclosed here; Part 5 gives the "milliseconds of skew, many seconds of sequencing granularity" rule.
k = 2 is the observed upper bound on production durability rules (Sugu-observed; attributed in part to MySQL semi-sync corner-case pain).
N = 5 worked example: durability k = 2 → election reach N − k + 1 = 4; majority quorum = 3 for both.
YouTube: single-ack completion, wide-scan election; election-width "consciously" bounded (exact bound not disclosed).
No disclosed numbers for: actual quorum size at YouTube, actual clusters running each of the four deployment shapes, measured tail-latency deltas per durability-rule choice, election-scan width distribution.

Caveats¶

Theoretical / conceptual post, not a production retrospective. No single worked production incident; four deployment-shape examples are "loosely derived from real production workloads" — the "loosely" disclaims exact-match accuracy.
The YouTube single-ack-completion disclosure is the only first-person operational datum. The rest of the post operates on abstract configurations; deployment-shape specifics (which company ran the four-zone-single-node shape? the two-region-two-zone-per-region shape?) are not named.
The FlexPaxos paper is not cited by name in Part 3. Sugu states the intersecting-quorum generalisation in plain English without the academic-citation framing. Part 8 cites Howard/Malkhi/Spiegelman 2017 explicitly.
The post defers failure-detection, node-membership, and election-race handling to later instalments — consistent with the series' incremental-exposition style. Part 5 handles races; Part 8 explicitly scopes failure-detection and node-membership as out of scope for the consensus core.
"Loss of availability may become more expensive than the potential loss of that data" is not a blanket recommendation to sacrifice durability. It is a statement that the operator must decide. Systems like financial ledgers never abandon unreachable nodes; systems like cache-coherence fabrics might prefer stalling briefly; the framework accommodates both.
The post predates and does not anticipate all of Part 8's Vitess+VTOrc worked composition. Some phrasing ("this is exactly what many users have done with Vitess") is retrospectively cleaner than was possible in 2020 when VTOrc was still being built out.

Series context¶

This is Part 3 of an 8-part series by Sugu Sougoumarane. Published 2020-09-26 originally; re-fetched by the RSS poller on 2026-04-21. All remaining instalments are now on the wiki:

Part 1 — Consensus algorithms at scale. Original consensus properties (Paxos) and why practical scenarios need modification. Not yet on the wiki.
Part 2 — Practical scenarios. Practical-scenario-ready consensus modified to accept a series of requests; single-leader narrowing; durability-agnostic rules. Not yet on the wiki.
Part 3 (this post) — Use cases. Four deployment shapes incompatible with majority-quorum; intersecting-quorum generalisation stated in plain English; request/election frequency asymmetry.
Part 4 — Establishment and revocation. Revoke + establish separated as first-class concerns; Vitess PRS/ERS as worked instance.
Part 5 — Handling races. Elector vs candidate; lock-based vs lock-free race resolution; VTOrc as the canonical elector.
Part 6 — Completing requests. Tentative/durable/complete three-stage model; early-ack-on-durability.
Part 7 — Propagating requests. Request propagation across leadership changes; versioning as the conflict-resolution primitive.
Part 8 — Closing thoughts. Capstone essay; pluggable-durability and lock-based-at-scale canonicalised as the two architectural recommendations; FlexPaxos cited by name.

Part 3 is in the wiki's middle tier of the series — foundational enough that Parts 4-8 rest on its framework, but dependent on the not-yet-ingested Parts 1-2 for the "why are we doing this in the first place?" framing. The wiki has adequate context to treat Part 3 on its own terms; re-scraping would complete the series.

Source¶

systems/vitess — the canonical worked production instance of durability-as-plugin.
systems/mysql — semi-sync substrate whose corner-case behaviour is (per Sugu) the likely cause of the observed k ≤ 2 upper bound.
systems/planetscale — publisher; Sugu was CTO.
concepts/durability-as-use-case-dependent — new concept canonicalising the foundational claim.
concepts/intersecting-quorums — new concept canonicalising the FlexPaxos generalisation (stated 2 years before Part 8 cites the paper by name).
concepts/failure-tolerance-envelope — new concept canonicalising the operator-specified bound on concurrent failures.
concepts/availability-vs-data-loss-tradeoff — new concept canonicalising the out-of-envelope compromise.
concepts/no-distributed-consensus — structural contrast on the same design axis.
concepts/durable-request — definition depends on the durability predicate Part 3 canonicalises.
concepts/minority-quorum-writeability — reframed by Part 3 as a legitimate design point when paired with wider election reach.
concepts/durability-vs-consistency-guarantee — the two axes the intersecting-quorum generalisation navigates.
patterns/pluggable-durability-rules — Part 3 is the foundational argument; Part 8 canonicalises as a named pattern.
patterns/optimize-for-common-case-frequency-asymmetry — new pattern generalising the request/election frequency-asymmetry framing.
patterns/single-ack-completion-with-wider-election — new pattern canonicalising the YouTube production instance.
companies/planetscale — publisher company page.