Skip to content

PLANETSCALE 2022-07-07

Read original ↗

PlanetScale — Consensus algorithms at scale: Part 8 - Closing thoughts

Summary

Closing instalment of Sugu Sougoumarane's Consensus algorithms at scale series on the PlanetScale blog (originally 2022-07-07; Sugu is Vitess co-creator, ex-YouTube, PlanetScale CTO). Part 8 is a capstone essay — no new protocol disclosure, but it consolidates the seven-part framework into a single architectural position and names the two load-bearing design decisions Sugu recommends for large-scale production systems: pluggable durability and lock-based over lock-free leader election.

The opening move is to reject the framing that Paxos and Raft are foundational to consensus systems: "We started off this series by challenging the premise that algorithms like Paxos and Raft are foundational to consensus systems. Such a premise would imply that any other algorithm would just be a variation of the original ones. These algorithms are foundational from a historical perspective, but they are not conceptually foundational." Sugu's position is that the conceptual primitives — durability requirement, revoke/establish split, race handling, completion protocol, propagation — are foundational; Paxos and Raft are two particular compositions of those primitives, not the only valid compositions. FlexPaxos is cited as the first crack in the traditional framing: "FlexPaxos was the first advancement that highlighted that the majority quorum is just a special case of intersecting quorums. And intersecting quorums would allow you to configure systems with more flexibility." The series' project was to take that crack all the way to the conceptual bedrock.

Two architectural recommendations are canonicalised. First, durability should be a plugin (not hard-coded into the protocol): "A consensus system can be designed in such a way that it assumes nothing about the durability rules. These can be specified with a plugin, and the system should be able to fulfill these requirements without breaking integrity. Of course, the requirements have to be reasonable." This is the practical payoff of FlexPaxos's intersecting-quorums generalisation — a cross-zone or cross-region durability rule can be expressed without touching the consensus core. Canonical new patterns/pluggable-durability-rules pattern. Structural consequence: "A system that supports pluggable durability allows you to deploy additional nodes to the system without majorly affecting its performance characteristics. For example, if you had specified the durability requirement as cross-zone, deploying additional nodes to a zone keeps the system behaving mostly the same way." Topology changes decouple from performance under a durability-plugin substrate — adding replicas in an already-covered zone is free; adding a new zone updates the plugin rather than the protocol.

Second, lock-based approaches dominate lock-free approaches at scale for four reasons. "In general, a lock-free approach (like what Paxos uses) has elegance from the fact that it does not have a time component. However, lock-based approaches offer so many other flexibilities that they win out in real-life scenarios; With lock-based approaches, you can: (1) Perform graceful leadership changes by requesting the current leader to step down. (2) Although I didn't cover this topic, it is easier to add or remove nodes in a system. (3) You can perform consistent reads by redirecting the read to the current leader. (4) You can implement anti-flapping rules." Canonical new patterns/lock-based-over-lock-free-at-scale pattern — closes a forward-reference from Part 5 where Sugu recommended lock-based for large-scale systems without yet enumerating the four advantages. "Due to all these advantages, most large-scale systems implement a lock-based approach."

Vitess is framed as the canonical worked composition: Vitess + VTOrc implements (a) pluggable durability ("durability rules are a plugin for vtorc. The current plugin API is already more powerful than other existing implementations. You can specify cross-zone or cross-region durability without having to carefully balance all the nodes in the right location"), (b) graceful failover built into the Vitess Operator for software deployment, (c) consistent reads via direct-to-leader routing, (d) anti-flapping rules inherited from Orchestrator. The four lock-based advantages are all instantiated. Sugu discloses Vitess's full-auto-pilot intent: "There are still a few corner cases that may require human intervention. We intend to enhance vtorc to also remedy those situations. This will put Vitess on full auto-pilot." — a canonical roadmap disclosure on VTOrc.

The essay closes by scoping what was deliberately left out (failure detection, consistent reads, node membership changes — "Strictly speaking, these are outside the scope of consensus algorithms, but they need to be addressed for real-life deployments") and offering intellectual humility: "It is possible that consensus could be generalized using a different set of rules. But I personally find the approach presented in this series to be the easiest to reason about." The reframing-as-pedagogy claim closes the arc — the series' project is not to invent a new consensus algorithm but to provide a conceptual substrate that makes Paxos, Raft, Vitess+Orchestrator, and FlexPaxos all legible as instantiations of the same underlying primitives.

Key takeaways

  • Paxos and Raft are foundational historically, not conceptually. "We started off this series by challenging the premise that algorithms like Paxos and Raft are foundational to consensus systems. Such a premise would imply that any other algorithm would just be a variation of the original ones. These algorithms are foundational from a historical perspective, but they are not conceptually foundational." (Source: sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-8-closing-thoughts). Canonical wiki framing: the conceptual primitives — durability requirement, revoke/establish split, race handling, completion protocol, propagation — are foundational; Paxos and Raft are two particular compositions of those primitives, not the definitional base.

  • Paxos and Raft are too rigid to adapt to modern cloud deployments. "We also showed that these algorithms are too rigid. I feel that they would struggle to adapt to the growing complexities of cloud deployments." Canonical claim: the fixed-majority-quorum assumption doesn't map naturally onto cross-zone, cross-region, or heterogeneous-durability requirements. The framework can be bent (FlexPaxos) but the bending exposes that the fixed assumption was itself the problem.

  • FlexPaxos canonicalised as the first crack in the traditional framing. "FlexPaxos was the first advancement that highlighted that the majority quorum is just a special case of intersecting quorums. And intersecting quorums would allow you to configure systems with more flexibility." Canonical wiki datum: intersecting quorums are the generalisation; majority quorum is a particular (symmetric, simple-to-analyse) special case. The generalisation is what enables cross-zone / cross-region durability without re-deriving the protocol.

  • Architectural recommendation 1: durability as a plugin. "A consensus system can be designed in such a way that it assumes nothing about the durability rules. These can be specified with a plugin, and the system should be able to fulfill these requirements without breaking integrity. Of course, the requirements have to be reasonable. We covered some examples in part 3." Canonical new patterns/pluggable-durability-rules pattern. The protocol accepts any durability rule expressible as an intersection-preserving predicate over node sets; the specific rule (cross-zone, cross-region, N-of-M, custom) is config, not code.

  • Structural consequence of pluggable durability: topology elasticity. "A system that supports pluggable durability allows you to deploy additional nodes to the system without majorly affecting its performance characteristics. For example, if you had specified the durability requirement as cross-zone, deploying additional nodes to a zone keeps the system behaving mostly the same way." Canonical wiki datum: adding replicas in an already-covered zone is free under a zone-durability plugin — the plugin constrains which intersections are valid, not how many nodes participate. Topology changes and performance become independent axes.

  • Leadership change recast as revocation + establishment, not single-atomic-action. "We have reconceptualized a leadership change as a two-step process: revocation and establishment. Intersecting quorums are only one way to achieve this goal. We have shown situations where you could achieve a leadership change by directly asking the previous leader to step down. Following this, all we have to do is perform the necessary steps to establish the new leadership. This approach does not require knowledge of intersecting quorums." Canonical wiki retrospective on patterns/separate-revoke-from-establish: the graceful-demotion path is explicitly "does not require knowledge of intersecting quorums" — it sidesteps the protocol altogether. Quorum-based revocation (failure case) and step-down-based revocation (planned case) are interoperable on the same cluster.

  • Multiple revocation methods are interoperable. "We have also shown that multiple methods can be used to change leadership, and that such methods are interoperable. For example, you could use the direct leadership demotion for planned changes, but fall back to intersecting quorums if there are failures in the system." Canonical production shape: graceful demotion for software rollouts (daily) + quorum-revocation for crash/partition failover (rare). Both satisfy the same invariant; the choice is a failure-frequency-driven optimisation.

  • Architectural recommendation 2: lock-based dominates lock-free at scale, four-fold. "In general, a lock-free approach (like what Paxos uses) has elegance from the fact that it does not have a time component. However, lock-based approaches offer so many other flexibilities that they win out in real-life scenarios; With lock-based approaches, you can: (1) Perform graceful leadership changes by requesting the current leader to step down. (2) Although I didn't cover this topic, it is easier to add or remove nodes in a system. (3) You can perform consistent reads by redirecting the read to the current leader. (4) You can implement anti-flapping rules. Due to all these advantages, most large-scale systems implement a lock-based approach." Canonical new patterns/lock-based-over-lock-free-at-scale pattern — closes Part 5's forward-reference "a lock-based system should be preferred for large scale consensus systems" by finally enumerating why. Four advantages, each a different axis: graceful-transition UX, node-membership operational simplicity, leader-local consistent reads, stability discipline.

  • Propagation + versioning recap: the most-difficult-part made tractable. "We studied the corner cases of propagating requests, and suggested versioning of decisions as a way to avoid confusion when there are multiple partial failures. The proposal numbers in Paxos and the term numbers in Raft are just one way to version the decisions. We also showed that many of these failure modes can be completely avoided using anti-flapping rules." Canonical wiki cross-instalment payoff: Part 7's concepts/request-versioning resolution was formally required but operationally optional for systems with strong concepts/anti-flapping (Orchestrator/VTOrc track record). Proposal numbers and term numbers are worked instances of the same versioning primitive.

  • Vitess as canonical worked composition — four features map to four lock-based advantages. "In Vitess, we make full use of the above options and flexibilities. For example, durability rules are a plugin for vtorc. The current plugin API is already more powerful than other existing implementations. You can specify cross-zone or cross-region durability without having to carefully balance all the nodes in the right location. Additionally, Vitess has a graceful failover mechanism that gets used during software deployment. This automation comes built-in as part of the Vitess Operator. Vitess allows you to direct reads to the current leader for consistent reads." Canonical wiki four-way mapping: (1) VTOrc durability plugin = patterns/pluggable-durability-rules; (2) Vitess Operator graceful failover = patterns/graceful-leader-demotion; (3) direct-to-leader routing = consistent reads without quorum; (4) inherited Orchestrator anti-flapping = rate-limited leadership changes. All four lock-based advantages are instantiated in one composite system. Canonical disclosure: Vitess's durability plugin API is "more powerful than other existing implementations" — a comparative claim about the FlexPaxos-inspired-but-extended state of the art in production MySQL leadership management.

  • VTOrc full-auto-pilot roadmap disclosed. "There are still a few corner cases that may require human intervention. We intend to enhance vtorc to also remedy those situations. This will put Vitess on full auto-pilot." Canonical VTOrc roadmap on the wiki: the system is currently mostly-auto with a tail of human-intervention cases; the stated intent is to close the tail. Previously Part 7 stated "we also intend to tighten some of these corner cases to minimize the need for humans to intervene"; Part 8 escalates to the full-auto-pilot framing.

  • Deliberately out-of-scope: failure detection, consistent reads, node membership. "There are still a few topics that could be worth covering: Failure detection, Consistent reads, Adding and removing nodes. Strictly speaking, these are outside the scope of consensus algorithms, but they need to be addressed for real-life deployments. I can cover these later with some independent posts." Canonical wiki scope-boundary disclosure: consensus-proper ends at "given a stable leader and a durable history, produce a correct successor"; the production-deployment adjacencies are acknowledged as necessary but treated as separable concerns. The series does not claim to be a complete production-deployment manual.

  • Intellectual humility on the framework's uniqueness. "It is possible that consensus could be generalized using a different set of rules. But I personally find the approach presented in this series to be the easiest to reason about." Canonical wiki datum: Sugu does not claim the revoke/establish + race + completion + propagation decomposition is the only valid conceptual substrate — only that it is his preferred one for pedagogy and architectural reasoning. Alternative decompositions remain possible; this one's claim is reason-ability, not uniqueness.

Systems

  • Vitess — canonical worked composition of all four lock-based advantages; durability plugin via VTOrc; graceful failover via Vitess Operator; direct-to-leader consistent reads; anti-flapping inherited from Orchestrator. Ninth canonical Vitess-internals disclosure on the wiki (after evalengine, data motion, query routing, throttler, fork management, backup / observability, authz, online-DDL + schema revert, and Part 7's propagation).
  • VTOrc — durability-plugin carrier, failover orchestrator, full-auto-pilot target; VTOrc plugin API is "more powerful than other existing implementations" (comparative claim).
  • Orchestrator — anti-flapping-lineage ancestor; Vitess inherits its safety properties via the VTOrc fork.
  • MySQL — the replication substrate underlying the composition; GTID + timestamp carry the version-metadata the propagation layer requires.

Concepts

  • Request propagation — recap context; propagation's hardest-failure-modes resolved by versioning + anti-flapping.
  • Revoke and establish split — the capstone re-emphasises the split as the foundational generalisation over Paxos/Raft's conflated single-atomic-action.
  • Anti-flapping — named as one of the four lock-based advantages; Orchestrator/VTOrc as canonical production instances.
  • Leader revocation + Leader establishment — the two separated steps, each with its own independently-swappable mechanism set.
  • Request versioning — Part 7 resolution recap; proposal numbers and term numbers as canonical instances.
  • Elector — the agent running the leadership-change protocol; separate from the candidate; VTOrc as canonical production instance.
  • No distributed consensus — structural contrast: Sugu's capstone sits at the yes-consensus end of the axis; Fly.io's Corrosion is the canonical no-consensus end.
  • Consistent read — named as one of the four lock-based advantages (direct-to-leader read under a stable leader).
  • Proposal number — Paxos's particular versioning mechanism, generalised to request-versioning at large.

Patterns

Operational numbers

  • No disclosed numbers. Capstone essay; Sugu defers quantitative disclosures to the preceding seven parts and to the referenced production-deployment adjacencies (failure detection, consistent reads, node membership) that he names as out-of-scope.
  • Comparative claim (unquantified): VTOrc's durability plugin API is "more powerful than other existing implementations" — relative to Orchestrator upstream and other MySQL failover tools. No specific feature matrix or benchmark disclosed.

Caveats

  • Capstone essay, not primary-protocol disclosure. Part 8 introduces no new protocol mechanism; it consolidates Parts 1–7 into a retrospective-plus-recommendation. All quoted disclosures were already present in prior parts except the four-advantage enumeration (which was a forward-reference from Part 5) and the VTOrc full-auto-pilot intent (which Part 7 mentioned only partially).
  • Framework-uniqueness is not claimed. Sugu explicitly acknowledges "It is possible that consensus could be generalized using a different set of rules" — the series' project is pedagogical clarity, not theoretical novelty. Alternative decompositions (e.g. virtual synchrony, viewstamped replication, atomic broadcast) are not compared against.
  • Vitess's durability-plugin-API-is-more-powerful-than-other-existing-implementations claim is unsubstantiated in this post. No feature matrix, no comparison to upstream Orchestrator's durability handling, no comparison to CockroachDB / TiDB / YugabyteDB / Spanner durability primitives. The claim is asserted; readers would need to compare implementations independently.
  • Out-of-scope topics are named but not covered. Failure detection, consistent reads, and node-membership changes are acknowledged as production-necessary but deferred to "independent posts" that may or may not materialise. This series does not close those loops.
  • "Rigid" as applied to Paxos/Raft is a qualitative judgement. Sugu's claim that these algorithms "would struggle to adapt to the growing complexities of cloud deployments" is an architectural opinion — counter-examples exist (Google Spanner, etcd, CockroachDB all use Raft or Multi-Paxos variants and scale to cloud workloads). The claim is Paxos/Raft force a specific set of trade-offs that may not match every deployment's needs, not that they cannot scale at all.
  • FlexPaxos framing is Sugu's selective reading. The FlexPaxos paper (Howard/Malkhi/Spiegelman 2017) generalises majority quorums to intersecting quorums on the read and write sides independently; Sugu cites this as "the first advancement" but a fuller survey of consensus generalisations (e.g. Fast Paxos, Generalized Paxos, EPaxos) would show FlexPaxos is one of several.
  • Anti-flapping-as-propagation-race-mitigation inherits Part 7's empirical-not-formal status. Sugu's claim that anti-flapping makes versioning "operationally optional" rests on the Orchestrator production track record, not a proof. Part 8 does not add a proof.

Series context

This is Part 8 (closing thoughts) of Sugu Sougoumarane's 8-part Consensus Algorithms at Scale series. The complete series on the wiki:

  • Part 1 — Original consensus properties (Paxos) and why practical scenarios need modification. Not yet on the wiki.
  • Part 2 — Practical-scenario-ready consensus modified to accept a series of requests; single-leader narrowing. Not yet on the wiki.
  • Part 3 — Durability-agnostic rules; rejection of majority-quorum as a core building block. Not yet on the wiki.
  • Part 4 — Establishment and revocation — revocation + establishment separation; Vitess PRS/ERS as worked instance.
  • Part 5 — Handling races — elector/candidate split; lock-based vs lock-free; forward-reference to Part 8's lock-based recommendation.
  • Part 6 — Completing requests — two-phase protocol (tentative → durable → complete); MySQL semi-sync split-brain as non-example.
  • Part 7 — Propagating requests — seven failure modes; request-versioning resolution; anti-flapping production shortcut; VTOrc as Orchestrator fork.
  • Part 8 — Closing thoughts (this post) — capstone; pluggable durability + lock-based-at-scale as the two architectural recommendations; Vitess as canonical worked composition; VTOrc full-auto-pilot intent.

Parts 1–3 could be ingested in future if re-fetched; they carry the foundational framing (Paxos properties, practical modifications, durability-agnostic rules) that the later parts build on. Parts 4–8 are on the wiki as canonical primary sources.

Source

Last updated · 347 distilled / 1,201 read