Skip to content

PLANETSCALE 2022-04-28

Read original ↗

PlanetScale — Consensus algorithms at scale: Part 5 — Handling races

Summary

Sugu Sougoumarane (Vitess co-creator, PlanetScale, originally 2022-04-28, re-fetched via RSS 2026-04-21) publishes Part 5 of the Consensus algorithms at scale series. Where Part 4 factored leadership change into revoke + establish, Part 5 adds the third orthogonal concern: race handling between multiple electors — and introduces the elector role as a first-class concept distinct from the candidate role.

The post's load-bearing taxonomic move: there are exactly two families of race-resolution strategy for leader election, and they correspond to "first elector wins" (lock-based) and "last elector wins, where 'last' means newest by proposal number" (lock-free). Paxos and Raft are both lock-based under the hood — Raft's term-number push to a majority is a distributed-lock acquisition, shadowed by the revoke/establish steps bundled into the same round-trip. Making the lock explicit and separate (as Vitess does by using etcd as the leadership lock) is the unlock that lets you tune the lock mechanism independently of the revoke/establish mechanisms.

The elector/candidate split canonicalised. "It is not necessary for a candidate to elect itself as leader. A separate agent can perform all the necessary steps to promote a candidate as leader. We will call this the elector." YouTube's operational shape — 15 eligible primaries per shard, one elector per region — is the motivating example. Vitess's VTOrc is named as Vitess's elector component; this is the canonical wiki introduction of the elector vocabulary.

The lock-based vs lock-free trade-off resolved at scale. Sougoumarane's verdict, stated in Part 5 short-form (Part 8 restates with full four-advantage enumeration): "The elegance of a lock-free approach may seem tempting, but the lack of a stable leader complicates everything else. Having to reach a quorum for consistent reads is a major drawback for scaling systems. Weighing these options, a lock-based system should be preferred for large scale consensus systems." The load-bearing cost driver is consistent reads at scale — lock-based leases let reads go direct-to-leader; lock-free forces quorum reads. At thousands of reads per second per cluster, the round-trip difference dominates.

The forward-progress axis. Lock-free systems get forward progress for free (a crashed elector never blocks a later one — newer-proposal-number just supersedes). Lock-based systems must manufacture forward progress via a time component: "any elector that obtains a lock must complete its task within a certain period of time, after which the lock is automatically released." The time component serves dual duty as the lock auto-release and as the leader lease — the same primitive enabling consistent reads without quorum.

Architectural contribution to the wiki: canonicalises the lock-based vs lock-free taxonomy, the elector/candidate split, the proposal-number/term-number unification, the forward-progress vs time-component trade-off, the leader-lease primitive enabling consistent reads, and Vitess's etcd-backed external-coordinator shape as the canonical production instance. This is the canonical-source post for elector, proposal number, forward progress, leader lease, and revoke-and-establish split — five already-named concepts on the wiki that were forward-referenced from Parts 6–8 but introduced here.

Key takeaways

  • Races between electors are inevitable once there is more than one elector. "If we had only one agent that performed leadership changes, there would be no reason to worry about races. But this is not practical because a network partition could isolate that agent and prevent it from performing the necessary actions. Introducing more than one agent to perform leadership changes requires us to handle race conditions." (Source: sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-5-handling-races). The single-elector design is unavailable for reliability reasons, so race handling is a structural concern, not a corner case.

  • There are exactly two race-resolution families: lock-based ("first elector wins") and lock-free ("last/newest elector wins"). Sougoumarane's taxonomic claim: "There are two approaches to resolving races: either the first agent wins, or the last one wins. The determination of who is first or last can vary depending on the approach used… An approach that makes the first agent win essentially prevents later agents from succeeding. This is equivalent to obtaining a lock. We will therefore call this approach lock-based. The other one will be called the lock-free approach." The two families are fundamentally different ("this difference is quite fundamental, and it is surprising that it has not been called out explicitly for consensus algorithms"), and choosing between them is a load-bearing design decision.

  • Electors are conceptually distinct from candidates. "It is not necessary for a candidate to elect itself as leader. A separate agent can perform all the necessary steps to promote a candidate as leader. We will call this the elector… In many situations, like in the case of YouTube, this separation may be necessary. Unsurprisingly, Vitess has also inherited this separation: VTorc is a Vitess component that acts as the elector." The elector/candidate split is the taxonomic enabler for everything that follows: races are between electors, not between candidates; the lock (if any) is held by an elector; proposal numbers (if any) are assigned by electors.

  • YouTube operational shape: 15 eligible primaries per shard, one elector per region. "At YouTube, each shard had fifteen eligible primaries. Having all of them scan for failures and perform active failovers was not practical. Instead, we had one agent in each region that scanned for failures and also performed leadership elections." This is the canonical production instance of the elector/candidate separation — 15 × N shards of MySQL primaries would otherwise each be doing per-second failure-scanning work; one elector per region does it for all of them.

  • Raft's "lock acquisition" is implicit, not absent. "One may argue that a system like Raft does not obtain a lock even though it makes the first elector win. This is because the act of obtaining a lock is shadowed by other actions it takes. If you subtract out the other actions (revoke, establish, and propagate) in the code that performs an election, it will be evident that what is left is the act of obtaining a distributed lock." Raft's term-number push to a majority IS the lock — it's fused with the revoke and establish steps into a single majority-quorum operation. The fusion is why Raft is perceived as not using a lock, but making the lock explicit is what lets you tune it independently.

  • Majority-quorum is one way to achieve lock acquisition; it is not the only way. "The primary requirement of using the existing nodes to obtain a lock is that the set of nodes each elector reaches has to intersect with those of the others. A majority-based system automatically ensures this intersection." Intersection is the invariant; majority is a convenient way to guarantee it but not the only one. Alternatives named: use a simple majority for lock acquisition while using a more sophisticated approach for revocation and establishment (FlexPaxos-style separation); external coordinator ("In Vitess, we use an external system like etcd to obtain such a lock"); human-authorized elector ("Humans could decide to manually authorize an elector to perform a leadership change, essentially giving it a 'lock'").

  • Vitess as canonical external-coordinator instance: etcd holds the leadership lock, Vitess handles the database-level revoke/establish. "In Vitess, we use an external system like etcd to obtain such a lock. The decision to rely on another consensus system to implement our own may seem odd. But the difference is that Vitess itself can complete a massive number of requests per second. Its usage of etcd is only for changing leadership, which is in the order of once per day or week." The cost-asymmetry argument: thousands of requests/sec vs. once/day leadership changes → the external coordinator's cost is rounding error on the per-second amortised budget, and the architectural clarity of separating the lock from the data-plane is worth the integration. Canonical instance of patterns/external-coordinator-for-leadership-lock.

  • Under lock-based election, proposal numbers are not needed for electing a leader. "If you are using a lock-based approach, there is no need to use proposal numbers or leadership terms for the purpose of electing a leader. But we may still need to assign such a number to facilitate the propagation of requests. We will discuss this in a subsequent blog." The forward-reference closed in Part 7: proposal numbers reappear at the request-propagation layer as request versioning even in lock-based systems. Same mechanism, two different layers — the revoke-and-establish split is what lets you factor the usage.

  • Lock-based systems get a stable leader → direct-to-leader consistent reads → massive cost win at scale. "Once a lock is obtained, the elector is guaranteed that no one else will be changing the system while they hold the lock. A big advantage of this situation is that the current leader can be discovered, which allows us to use the graceful method of changing leadership described in the previous blog. In lock-based systems, we have to rely on accurate clocks. The system also has to make sure that sufficient tolerances are built into timeouts to account for normal clock skews, which are typically in the milliseconds. In general, it is advisable to use 'many seconds' of granularity to sequence events. Relying on locks lets us exploit the time component to give leaders a lease, thereby guaranteeing no leader change during the lease period. This lets the users perform efficient, consistent reads by accessing the leader without the need for a quorum read. The existing leader could continue to renew the lease as it completes more requests, which leads to prolonged stability." Canonical passage for concepts/leader-lease + concepts/consistent-read coupling + the millisecond-skew / many-seconds-granularity safety-margin recommendation.

  • Lock-free election converges via "newest elector wins" on proposal numbers. "In the case of a lock-free approach, the newest elector must win over an older one. This requires the algorithm to assign a time-based order to the electors that are racing with each other. In Paxos, these are referred to as proposal numbers. In Raft, these are known as term numbers. To facilitate reasoning, we can view these numbers as timestamps." Unifies Paxos proposal numbers and Raft term numbers under a single concept. The core protocol mechanism: "The core of the lock-free approach is that the followers that accept an establishment or revocation request must remember the timestamp of the elector that issued the request and should reject requests with older timestamps."

  • Lock-free safety rule: revoke against all potential candidates, not just the known leader. "To cover the above scenarios, every elector must assume that there may be another elector with an older timestamp attempting a leadership change. It must therefore attempt to revoke leadership from all potential candidates, not just the current known leader. The completion of this process ensures that all possible leaderships (present and future) with an older timestamp are invalidated. This addresses the case where an old elector is slow at performing its actions. This also adds safety against clock skews: a new leader with an incorrect older timestamp will just fail at completing a leadership change, as if it was an older leader." The revoke-all-potential-candidates rule absorbs both the slow-predecessor case and the clock-skew case under a single invariant.

  • Forward progress is free under lock-free, manufactured under lock-based. "The main advantage of a lock-free approach is that it naturally supports forward progress. If an existing elector fails, a different elector can initiate a new round without knowledge of the state or age of an older elector. For this reason, there is no need to depend on timeouts." Contrast: "However, the lock-based approach introduces a problem with forward progress. For example, an elector may successfully obtain a lock and then crash or become partitioned out of the rest of the system. This will prevent all other electors from ever being able to repair this situation. To resolve this, lock-based systems must introduce a time component: any elector that obtains a lock must complete its task within a certain period of time, after which the lock is automatically released." The time-component is the structural cost of lock-based design — and the same time-component that enables the leader lease.

  • Lock-based converges faster in the common case. "A lock-based approach generally converges faster than approaches that are lock-free. This is because the first node that attempts a leadership change is likely to have made the most progress towards completing the task. Under most circumstances, giving the first elector the chance to succeed will complete the leadership change with the least disruption." First-attempter-usually-wins is the common-case optimisation that lock-based inherits structurally — the elector that detects the failure first is also the one best positioned to complete the repair.

  • Lock-free's central drawback: no stable leader → quorum reads. "The disadvantage of a lock-free approach is that there is no certainty of a stable leader. This is because it is possible for a leadership to end between the time you discover it and give it a request. The absence of a stable leader makes consistent reads complicated. We essentially have to resort to quorum reads." This is the load-bearing cost factor that tips the verdict toward lock-based at scale.

  • Canonical verdict: prefer lock-based at scale. "The elegance of a lock-free approach may seem tempting, but the lack of a stable leader complicates everything else. Having to reach a quorum for consistent reads is a major drawback for scaling systems. Weighing these options, a lock-based system should be preferred for large scale consensus systems. Having a stable leader simplifies many other operational parts of the system. In Vitess, the current leader for each cluster is published through its topology, and a large number of workflows rely on this information to perform various tasks. Any operation that does not want the leader to change just has to obtain a lock before doing its work." Canonical production proof: every Vitess operation that needs a stable leader (backups, schema changes, reparents) just grabs the etcd lock. The patterns/lock-based-over-lock-free-at-scale pattern rests on this architectural posture.

Systems

  • Vitess — canonical production instance of lock-based leader election via external coordinator; topology publishes current leader; lock-grab is the universal primitive for "any operation that does not want the leader to change."
  • VTOrc — Vitess's elector component. Scans for MySQL primary failures and initiates active failovers by acquiring the Vitess-topology lock (via etcd) and running the revoke/establish protocol.
  • etcd — Vitess's external leadership-lock substrate. Used only for leadership change (once per day or week per cluster); the per-request data-plane bypasses etcd entirely. Canonical instance of "the decision to rely on another consensus system to implement our own".
  • MySQL — the data-plane substrate underneath Vitess; MySQL semi-sync replication + GTID are the primitives Vitess uses for establishment and propagation.
  • PlanetScale — publisher; Sougoumarane was PlanetScale co-founder at time of writing.

Concepts

  • Elector — the agent that runs the leadership-change protocol; distinct from the candidate (the node eligible to become leader). Canonical-source page introduces here.
  • Proposal number — unified Paxos proposal-number / Raft term-number concept. The lock-free family's race-resolution mechanism (newest wins). Canonical-source page introduces here.
  • Forward progress — the liveness invariant that the system keeps moving despite elector failures. Free under lock-free, manufactured via time-component under lock-based. Canonical-source page introduces here.
  • Leader lease — time-bounded grant that no leadership change will occur during its validity; enables direct-to-leader consistent reads. Canonical-source page introduces here.
  • Revoke-and-establish split — the pedagogical decomposition from Part 4; Part 5 layers race handling on top as an orthogonal concern. Canonical-source page extends here.
  • Consistent read — existing wiki page; Part 5 canonicalises the lock-based → lease → direct-to-leader path as the cost-dominant way to achieve consistent reads at scale.
  • Anti-flapping — existing wiki page (canonical introduction in Part 7); Part 5 provides the structural foundation (stable leader + lock) that makes anti-flapping operationally straightforward to enforce.

Patterns

  • Lock-based leader election (new)"first elector wins" race-resolution family via exclusive lock acquisition; requires time-component for forward progress; enables leader lease + consistent reads.
  • Lock-free leader election (new)"newest elector wins" race-resolution family via monotonic proposal numbers; natively supports forward progress; no stable leader → quorum reads required.
  • Prefer lock-based over lock-free at scale — existing page (canonical source is here + Part 8). Part 5 is the original source; Part 8 is the capstone restatement with four-advantage enumeration.
  • External coordinator for leadership lock (new) — Vitess+etcd as the canonical shape: decouple the per-second data-plane consensus substrate from the once-per-day leadership-lock substrate; accept the cost of running two consensus systems because the cost is asymmetric.
  • Separate revoke from establish — existing pattern (canonical source Part 4); Part 5 layers race-resolution choice on top as the third orthogonal concern.
  • Pluggable durability rules — existing pattern; Part 5 references the composability: "Use a simple majority for the purpose of obtaining a lock, while using a more sophisticated approach for revocation and establishment."

Operational numbers

  • YouTube: 15 eligible primaries per shard, one elector per region. The operational shape that motivates the elector/candidate split.
  • Cost asymmetry: thousands of requests/sec vs once/day or once/week leadership changes — the load-bearing argument for paying the cost of an external coordinator like etcd.
  • Clock-skew tolerance: "typically in the milliseconds"; recommended sequencing granularity: "many seconds" — safety margin ~10³ for lock-based time-component sequencing.
  • No disclosed numbers for: etcd lock-acquisition latency, VTOrc failover completion time, lock auto-release timeout, lease renewal cadence, consistent-read RTT savings vs quorum read.

Caveats

  • Theoretical / taxonomic post, not a production retrospective. No production incidents narrated; no VTOrc failover telemetry; no etcd lock contention statistics; no leader-lease tuning data.
  • The verdict "prefer lock-based at scale" is stated in Part 5 short-form without full justification; the four-advantage enumeration (graceful demotion, node membership, consistent reads, anti-flapping) appears in Part 8. Reading Part 5 in isolation gives a partial argument — the capstone is required for the full case.
  • Lock-free counter-arguments are acknowledged but not deeply engaged. The "elegance from the fact that it does not have a time component" framing is honest but brief; readers seeking a fairer lock-free case (e.g. Multi-Paxos, or Chubby's long-lease approach over Paxos) should look elsewhere.
  • "Every elector must assume there may be another elector with an older timestamp attempting a leadership change" — the revoke-all-potential-candidates safety rule is stated as a requirement but the practical question of how to enumerate "all potential candidates" in a long-lived cluster with node churn is not addressed. Part 7 returns to this when discussing request propagation; Part 5 leaves it implicit.
  • Vitess's "use etcd for the lock" design is named but not architecturally walked-through — no RPC surface, no lock-lease-renewal protocol detail, no VTOrc-to-etcd protocol description. For Vitess operational depth see the Vitess docs.

Series context

The series arc by part:

  • Part 1 — introduction; not yet on the wiki.
  • Part 2 — single-leader narrowing; not yet on the wiki.
  • Part 3 — durability-agnostic rules; four production deployment shapes; FlexPaxos in plain English.
  • Part 4 — revoke + establish separation; Vitess PRS/ERS as worked instance.
  • Part 5 (this post) — race handling; lock-based vs lock-free taxonomy; elector/candidate split; forward progress vs time-component; canonical verdict "prefer lock-based at scale".
  • Part 6 — commit-path protocol; durable stage vs completed stage; retry-not-cancel rule.
  • Part 7 — request propagation; seven failure modes; per-request versioning; anti-flapping as mitigation.
  • Part 8 — capstone; pluggable durability + lock-based-at-scale as the two architectural recommendations; FlexPaxos cited by name.

Part 5 is the taxonomic hinge of the series — everything after it reads through the lock-based vs lock-free lens introduced here.

Source

Last updated · 542 distilled / 1,571 read