PlanetScale — Consensus algorithms at scale: Part 5 - Handling races¶
Summary¶
Fifth instalment of Sugu Sougoumarane's Consensus algorithms at scale series on the PlanetScale blog (originally 2022-04-28; Sugu is the co-creator of Vitess, ex-YouTube, PlanetScale CTO). Part 5 takes the earlier framework — leadership change decomposed into Revoke and Establish (concepts/revoke-and-establish-split) — and analyses how the two concerns compose under races between multiple agents. The series' central pedagogical move is to separate the role of the elector from that of the candidate: an elector is any agent permitted to perform leadership changes; a candidate is a node eligible to become leader. Most traditional consensus algorithms collapse these two roles, but real large-scale systems (YouTube, Vitess) factor them apart because the elector pool and candidate pool differ in number, failure modes, and deployment locus.
The post classifies race-resolution strategies into two fundamental shapes. Lock-based strategies ensure "first elector wins" by obtaining a mutually exclusive lock before making changes; Raft is the canonical in-line example ("Raft obtains a lock, but it is implicit — it is shadowed by revocation, establishment, and propagation"). Lock-free strategies ensure "newest elector wins" by attaching a time-ordered number (Paxos proposal numbers, Raft term numbers) to every leadership attempt and requiring followers to reject requests with older numbers. The two differ along a fundamental design axis Sugu surfaces explicitly: whether there is a stable leader. Lock-based systems have one, and can therefore offer leader leases for efficient consistent reads; lock-free systems do not, and therefore require quorum reads for consistency. The essay closes with an explicit recommendation: "a lock-based system should be preferred for large scale consensus systems," precisely because lease-backed consistent reads dominate the cost structure at scale, and because a stable published-leader is the primitive many other operational workflows (backups, schema changes, routing) are written against.
Key takeaways¶
-
The elector is a distinct role from the candidate. A candidate is a node eligible to become leader; an elector is an agent permitted to run the leadership-change protocol. They are often the same (self-election in classical Raft) but need not be: "At YouTube, each shard had fifteen eligible primaries. Having all of them scan for failures and perform active failovers was not practical. Instead, we had one agent in each region that scanned for failures and also performed leadership elections." Vitess inherited this separation as VTOrc: "Vitess has also inherited this separation: VTorc is a Vitess component that acts as the elector."
-
The "one elector" model fails under network partition. If exactly one agent performs leadership changes, "a network partition could isolate that agent and prevent it from performing the necessary actions." Multiple electors are therefore mandatory, which mandates race resolution.
-
Race resolution has exactly two shapes: lock-based and lock-free. "There are two approaches to resolving races: either the first agent wins, or the last one wins. … An approach that makes the first agent win essentially prevents later agents from succeeding. This is equivalent to obtaining a lock. We will therefore call this approach lock-based. The other one will be called the lock-free approach." The taxonomy is fundamental and — Sugu explicitly flags — "surprising that it has not been called out explicitly for consensus algorithms."
-
Raft is implicitly lock-based. "One may argue that a system like Raft does not obtain a lock even though it makes the first elector win. This is because the act of obtaining a lock is shadowed by other actions it takes. If you subtract out the other actions (revoke, establish, and propagate) in the code that performs an election, it will be evident that what is left is the act of obtaining a distributed lock." The lock emerges as a byproduct of majority-quorum overlap: any two electors' reached-node-sets must intersect, giving at most one a quorum ack.
-
Lock-based systems need a time component for forward progress. A lock-holder that crashes or partitions would otherwise freeze the system indefinitely. "To resolve this, lock-based systems must introduce a time component: any elector that obtains a lock must complete its task within a certain period of time, after which the lock is automatically released." This reuse-as-timeout composes with leader leases.
-
Locks need not come from quorum. The classical construction borrows the lock from majority-quorum overlap, but other mechanisms suffice: (a) "Use a simple majority for the purpose of obtaining a lock, while using a more sophisticated approach for revocation and establishment"; (b) "In Vitess, we use an external system like etcd to obtain such a lock" — the leadership-change rate (once per day or week) is small enough that piggybacking on etcd's consensus costs nothing; (c) "Humans could decide to manually authorize an elector to perform a leadership change" — humans are a lock substrate too. The new pattern patterns/external-coordinator-for-leadership-lock canonicalises the Vitess/etcd instance.
-
Lock-free approaches require timestamp-ordered proposal numbers. "The core of the lock-free approach is that the followers that accept an establishment or revocation request must remember the timestamp of the elector that issued the request and should reject requests with older timestamps." Paxos proposal numbers and Raft term numbers are the canonical instances (concepts/proposal-number). "To facilitate reasoning, we can view these numbers as timestamps."
-
Lock-free convergence paths are two-branch. "(1) The elector with the older timestamp completes its election before the one with the newer timestamp. Following this, the one with the newer timestamp will end up revoking that leadership and establishing its own. (2) The elector with the newer timestamps completes first. Then the one with the older timestamp will fail at its attempt, and the leadership established by the newer timestamp will prevail." Both branches converge on the newer-timestamp leader.
-
Every lock-free elector must revoke leadership from all potential candidates, not just the current known leader. This is the safety requirement that covers the case of a slow older elector still in-flight and the case of a new leader carrying a stale/skewed timestamp: "It must therefore attempt to revoke leadership from all potential candidates, not just the current known leader. The completion of this process ensures that all possible leaderships (present and future) with an older timestamp are invalidated. This addresses the case where an old elector is slow at performing its actions. This also adds safety against clock skews."
-
The lock-based vs lock-free choice determines whether you have a stable leader — and therefore whether consistent reads can avoid a quorum round-trip. "The main advantage of a lock-free approach is that it naturally supports forward progress. If an existing elector fails, a different elector can initiate a new round without knowledge of the state or age of an older elector. For this reason, there is no need to depend on timeouts. The disadvantage of a lock-free approach is that there is no certainty of a stable leader. … The absence of a stable leader makes consistent reads complicated. We essentially have to resort to quorum reads." Conversely, lock-based systems can offer a leader lease (concepts/leader-lease) during which no leader change will occur, and readers can hit the leader directly — the new patterns/lease-based-consistent-read pattern.
-
Leases are renewable to give the steady-state leader arbitrarily long stability. "The existing leader could continue to renew the lease as it completes more requests, which leads to prolonged stability." Renewal-on-activity converges to a stable leader under normal operation, which is the condition most of the system is actually optimised for.
-
Clock-skew tolerance is a real concern for lock-based systems. "In lock-based systems, we have to rely on accurate clocks. The system also has to make sure that sufficient tolerances are built into timeouts to account for normal clock skews, which are typically in the milliseconds. In general, it is advisable to use 'many seconds' of granularity to sequence events." Milliseconds of skew plus seconds of sequencing granularity is the concrete safety margin Sugu recommends.
-
Lock-based converges faster on the common path. "A lock-based approach generally converges faster than approaches that are lock-free. This is because the first node that attempts a leadership change is likely to have made the most progress towards completing the task." First-in is correlated with most-informed because the agent that detected the failure first was physically or topologically closest to the failing node.
-
Closing recommendation: "Weighing these options, a lock-based system should be preferred for large scale consensus systems. Having a stable leader simplifies many other operational parts of the system. In Vitess, the current leader for each cluster is published through its topology, and a large number of workflows rely on this information to perform various tasks. Any operation that does not want the leader to change just has to obtain a lock before doing its work." The published-stable-leader is the primitive external workflows (backups, schema changes, query routing, operational scripts) lock against.
Systems extracted¶
- Vitess — the sharded-MySQL substrate behind PlanetScale. Series written from the perspective of Vitess's operational experience at YouTube-scale. Uses external etcd as the leadership-lock substrate and VTOrc as the elector. The series canonicalises the elector/candidate split and the lock-based-with-lease choice for Vitess.
- VTOrc — new canonical wiki system page. Vitess's failure-scanning + leader-election elector. One VTOrc per region; VTOrcs scan for failures and initiate active failovers; multiple VTOrcs per cluster participate in leader-change races, resolved through an etcd-backed lock.
- etcd — extended on this wiki. New Seen-in disclosure that Vitess uses etcd explicitly as the leadership-change lock substrate: "In Vitess, we use an external system like etcd to obtain such a lock. The decision to rely on another consensus system to implement our own may seem odd. But the difference is that Vitess itself can complete a massive number of requests per second. Its usage of etcd is only for changing leadership, which is in the order of once per day or week." Canonical wiki disclosure that high-QPS consensus systems can delegate their own coordination to a separate, lower-QPS consensus system — the paradox-that-isn't of using-consensus-to-implement-consensus.
- Raft — treated throughout as the canonical implicitly lock-based consensus algorithm; the essay argues Raft's quorum-acquire step is a distributed lock, just one shadowed by revocation + establishment + propagation.
- Paxos (no dedicated wiki page at this ingest) — treated as the canonical lock-free reference point via its proposal numbers, contrasted with Raft's implicit locking.
- YouTube — the production context Sugu is drawing from. 15 eligible primaries per shard; one elector (proto-VTOrc) per region. Concrete example of why elector and candidate roles are separated in practice.
Concepts canonicalised¶
- concepts/revoke-and-establish-split — existing wiki hook (series-wide): leadership change factors into Revoke (invalidate prior leader) + Establish (commit new leader), as two separately-designable concerns. Part 5 extends this with the lock-based vs lock-free decomposition of the race-resolution layer that sits on top of Revoke/Establish.
- concepts/elector — new canonical page. An agent permitted to perform a leadership-change protocol; distinct from (though can coincide with) a candidate. Named to enable taxonomic reasoning about lock-based vs lock-free races. Vitess/VTOrc and YouTube/per-region-elector are the canonical instances.
- concepts/forward-progress — new canonical page. The system property that election attempts must eventually succeed even in the presence of elector failures/partitions. Lock-based systems need explicit timeouts to preserve it; lock-free systems get it naturally from the newer-wins rule.
- concepts/leader-lease — new canonical page. A time-bounded grant to a leader during which no leader change will occur; enables local reads without a quorum round-trip. Renewable on activity. Only coherent under a lock-based election protocol (a lock-free system has no stable leader to lease).
- concepts/proposal-number — new canonical page. The timestamp-style totally-ordered identifier attached to every leadership attempt in lock-free consensus. Paxos proposal numbers, Raft term numbers, and generic lock-free designs all instantiate it. Followers remember the latest seen and reject stale.
- concepts/consistent-read — new canonical page. A read that observes all writes committed before it. In a lock-based system with a valid leader lease, a read from the leader is trivially consistent. In a lock-free system with no stable leader, consistent reads must go through a quorum round-trip.
- concepts/no-distributed-consensus — existing page, extended. Fly.io framing already canonicalised the avoid-consensus-at-WAN case; Part 5 gives the inverse framing: even when you are running a consensus system, you can delegate its own coordination (leader-change locks) to a separate consensus system at coarser granularity. The two framings together sharpen the wiki's treatment of when consensus belongs where.
- concepts/split-brain — existing page, extended. Part 5 names race-between-electors as a precursor shape of split-brain and the lock-based/lock-free split as the two tool-families for preventing it.
- concepts/soft-leader-election — existing page, extended. Sugu's "hard" (lock-based / lock-free) framing sits orthogonally to the Databricks-Dicer "soft" framing; cross-reference bidirectionally.
Patterns canonicalised¶
- patterns/lock-based-leader-election — new canonical page. First-elector-wins race resolution via mutually exclusive lock before performing Revoke + Establish. Forward progress requires time-bounded lock with auto-release. Raft is the implicit instance; Vitess/etcd is the explicit-external-lock instance.
- patterns/lock-free-leader-election — new canonical page. Newest-elector-wins race resolution via totally-ordered proposal numbers. Forward progress is free. Every elector must revoke all potential candidates, not just the known leader. Paxos and Raft (via term numbers) are the canonical instances.
- patterns/lease-based-consistent-read — new canonical page. Under a lock-based election with leader lease, serve reads directly from the leader without a quorum round-trip; the lease guarantees no concurrent leader change. Only applicable when the election shape is lock-based.
- patterns/external-coordinator-for-leadership-lock — new canonical page. Use a separate consensus system (etcd, ZooKeeper, Consul, or a human) as the lock substrate for leadership changes, decoupled from the main data-plane consensus. Justified when leadership changes are rare (Vitess: "once per day or week") and the data-plane consensus is high-QPS. Canonical instance: Vitess → etcd.
Operational numbers¶
- YouTube shard primary pool: "fifteen eligible primaries" per shard — the concrete scale motivating the elector/candidate split.
- Elector deployment locus at YouTube: "one agent in each region" — the per-region failure-scanner pattern.
- Vitess leadership-change rate: "in the order of once per day or week" — the duty cycle that justifies delegating leadership locks to etcd rather than the high-QPS Vitess data plane.
- Clock-skew tolerance: "typically in the milliseconds" — target skew level for inter-node clocks.
- Recommended sequencing granularity: "'many seconds' of granularity to sequence events" — safety margin for the skew.
- No other production numbers disclosed: no latency bounds, no lease-duration values, no elector-election-time measurements. The post is taxonomic rather than quantitative.
Caveats¶
- Taxonomic essay, not mechanism spec. Sugu names and contrasts the two approaches but does not specify a complete algorithm for either; the Vitess-specific implementation details (VTOrc's actual election state machine, the etcd lock-holding protocol, the lease-renewal cadence, the keyspace-scope of locks) are not disclosed.
- Published leadership, not read consistency, is the framing. The essay asserts that lock-free systems "have to resort to quorum reads" without analysing alternative mechanisms (leader-with-lease-under-lock-free, read-leases, linearizable-read-protocols like TrueTime-style leases). Treat the quorum-read conclusion as a Sugu-framing choice scoped to the simplest forms of each protocol family.
- Part 5 of 8 in the series; Parts 1-4 (foundations, use cases, revoke/establish) and Parts 6-8 (completing requests, propagating requests, closing thoughts) are not yet on the wiki — the broader series context will rebalance some of this post's framings when later parts canonicalise the propagation and completion layers.
- No explicit fencing-token discussion. The related but distinct fencing token concept (Kleppmann; Designing Data-Intensive Applications) — which solves "what if the deposed leader doesn't know it's deposed and still tries to write?" — is implicit in Sugu's "revoke all potential candidates, not just the known leader" rule but not named. A future ingest of Parts 6-7 (propagating requests) is the likely place this gets canonicalised on the wiki.
- Byline date ambiguity. The raw file's
published:field is 2026-04-21 (feed re-fetch date); the inline byline reads April 28, 2022. Architectural content is vintage 2022 Vitess but still current — the elector/candidate split and the etcd-for-leadership-lock design are unchanged in Vitess as of 2026. - "Lock-based converges faster" claim is assertion, not measurement. The argument ("first elector has most progress") is intuitive but not backed by benchmarks in the essay.
Source¶
- Original: https://planetscale.com/blog/consensus-algorithms-at-scale-part-5
- Raw markdown:
raw/planetscale/2026-04-21-consensus-algorithms-at-scale-part-5-handling-races-c81495d7.md
Related¶
- systems/vitess
- systems/vtorc
- systems/etcd
- systems/raft
- concepts/revoke-and-establish-split
- concepts/elector
- concepts/leader-lease
- concepts/proposal-number
- concepts/forward-progress
- concepts/consistent-read
- concepts/no-distributed-consensus
- concepts/split-brain
- concepts/soft-leader-election
- patterns/lock-based-leader-election
- patterns/lock-free-leader-election
- patterns/lease-based-consistent-read
- patterns/external-coordinator-for-leadership-lock
- patterns/leader-based-partition-replication
- companies/planetscale