---
title: "PlanetScale — Consensus algorithms at scale: Part 4 — Establishment and revocation"
type: source
created: 2026-04-23
updated: 2026-04-23
company: planetscale
author: Sugu Sougoumarane
published: 2022-04-06
fetched: 2026-04-21
url: https://planetscale.com/blog/consensus-algorithms-at-scale-part-4
tags: [planetscale, vitess, consensus, leader-election, revoke, establish, paxos, raft, reparenting, prs, ers, lameduck, query-buffering, vtgate, vttablet, sugu-sougoumarane, mysql, semi-sync-replication, software-rollout, failover, tier-3]
systems: [vitess, mysql, planetscale]
concepts: [leader-election, leader-revocation, leader-establishment, lameduck-mode, query-buffering-cutover, graceful-degradation]
patterns: [separate-revoke-from-establish, graceful-leader-demotion, zero-downtime-reparent-on-degradation]
sources: []
related: [systems/vitess, systems/mysql, systems/planetscale, concepts/leader-revocation, concepts/leader-establishment, concepts/lameduck-mode, concepts/query-buffering-cutover, concepts/no-distributed-consensus, patterns/separate-revoke-from-establish, patterns/graceful-leader-demotion, patterns/zero-downtime-reparent-on-degradation, companies/planetscale]
---
# PlanetScale — Consensus algorithms at scale: Part 4 — Establishment and revocation

## Summary

Sugu Sougoumarane (Vitess co-creator, PlanetScale, originally 2022-04-06, re-fetched via RSS 2026-04-21) publishes the fourth instalment of his consensus-algorithms-at-scale series. **The post's load-bearing claim**: every leader-based consensus algorithm performs two distinct actions when electing a new leader — **revoke** the previous leadership and **establish** the new one. Traditional majority-quorum algorithms (Paxos, Raft) conflate these into a single atomic action (a successful proposal-number push to a majority simultaneously revokes the old leader and establishes the new one), and this conflation hides the fact that the two concerns *could be separated*. Once you separate them, you can use different mechanisms for each, optimise each independently, and accommodate practical scenarios where majority-quorum-as-single-action doesn't fit.

**Worked production instance**: [[systems/vitess|Vitess]]'s two reparent operations — [`PlannedReparentShard`](https://vitess.io/docs/user-guides/configuration-advanced/reparenting/#plannedreparentshard-planned-reparenting) (**PRS**) for software-rollout-class planned changes, and [`EmergencyReparentShard`](https://vitess.io/docs/user-guides/configuration-advanced/reparenting/#emergencyreparentshard-emergency-reparenting) (**ERS**) for crash / network-partition-class unplanned failover — expose the separation at the operational-primitive level. PRS uses the **graceful-demotion path**: ask the current leader to step down → in-flight transactions complete under a `vttablet`-level [[concepts/lameduck-mode|lameduck]] → new transactions buffered at the `vtgate` proxy tier → once PRS completes the buffered transactions flush to the new primary, and **the application sees no errors**. ERS uses the **fence-the-followers path**: the old leader is unreachable so revocation is achieved by telling the followers to stop accepting its writes, then establishing a new leader.

**Design principle canonicalised**: *"It is important that we optimise for the common case."* Software rollouts are daily; crashes happen monthly or less. PRS-gracefully-is-the-common-path is structurally correct because the failure-mode frequencies make the optimisation axis obvious. The dramatic "cut the network cable" and "dispatch a human to physically shut down a machine where a leader had gone rogue" anecdotes (Google) illustrate that **any** mechanism that satisfies the revocation invariant is a valid revocation — separation of concerns across revocation methods is the unlock.

**Architectural contribution to the wiki**: canonicalises revoke and establish as separate concerns (not a single atomic action), the graceful-demotion cutover path, the lameduck primitive that makes it work, and the Vitess PRS/ERS operational shape. This is **fifth canonical Vitess-internals disclosure on the wiki** after [[sources/2025-04-05-planetscale-faster-interpreters-in-go-catching-up-with-cpp|evalengine]] (expression evaluation), [[sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale|VReplication / VDiff / MoveTables]] (data motion), [[sources/2026-04-21-planetscale-achieving-data-consistency-with-the-consistent-lookup-vindex|Consistent Lookup Vindex]] (cross-shard writes), [[sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-1|Throttler trilogy]] (load admission) — this post fills the **leader election / reparenting** axis of Vitess's control plane.

## Key takeaways

- **Every leader-based consensus algorithm performs two distinct actions when electing a new leader**: revoke the previous leadership, and establish the new one. *"An additional constraint is that a revoke must precede the establishment step. Otherwise, we will end up with more than one leader."* (Source: [[sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-4-establishment-and-revocation]]). The ordering is load-bearing — it is what keeps the at-most-one-leader invariant across transitions.

- **Majority-quorum algorithms conflate revoke + establish into a single atomic action, which hid the fact that they can be separated.** *"Because the revoke was implicitly achieved, it was never called out as a separate concern. More importantly, it was never called out as a concern that could be separated."* Sougoumarane's load-bearing framing: the conflation is an implementation accident of majority-quorum protocols, not a structural necessity of consensus. Separation lets you **use different mechanisms for each step** and optimise them independently.

- **Leadership as invariant: *"Leadership is established when all the parameters are in place for a leader to successfully complete requests. Any change that invalidates this condition is a revocation."*** Canonical wiki definition. Leadership is not a specific protocol action — it is **a set of conditions that must be true** for a node to act as leader. Revocation is any change that breaks those conditions. Establishment is any set of changes that makes them true again. This recasting is what unlocks swapping mechanisms per step.

- **Revocation methods are plural and interchangeable** as long as the invariant is satisfied. Traditional: push a new proposal number to a majority. MySQL-style: point a semi-sync replica at a different primary, or tell it to stop replicating. **Graceful**: ask the current leader to step down (only works if the leader is reachable — planned changes only). **Extreme**: cut the network cable that connects a leader to its followers (*"I know of one incident at Google where we had to dispatch a human to physically shut down a machine where a leader had gone rogue."*). Any mechanism that **makes the establishment condition false for the old leader** is a valid revocation.

- **If the current leader is known, ask it to step down — this is the graceful path.** *"If the current leader is known, requesting that leader to step down also results in a valid revocation. This method is generally more graceful because the leader has the opportunity to complete in-flight requests and also inform clients of an imminent change in leadership."* Graceful demotion is meaningful **only for planned changes** — if the leader is unreachable (crash, partition), fall back to fencing the followers. This is the core design choice that separates PRS (graceful) from ERS (fencing).

- **Vitess's PRS / ERS operational split canonicalised.** [[systems/vitess|Vitess]] exposes two commands at the shard level: [`PlannedReparentShard`](https://vitess.io/docs/user-guides/configuration-advanced/reparenting/#plannedreparentshard-planned-reparenting) for software rollouts (demote current primary gracefully before the update) and [`EmergencyReparentShard`](https://vitess.io/docs/user-guides/configuration-advanced/reparenting/#emergencyreparentshard-emergency-reparenting) for detected primary down/unreachable. The operational split is **the two revocation paths made into two user-facing commands**. Same underlying invariant; different mechanism per failure-mode class.

- **The PRS graceful cutover is a two-tier mechanism: lameduck at vttablet + buffering at vtgate.** *"If a PRS is issued, the low level vttablet component of vitess goes into a lameduck mode where it allows in-flight transactions to complete, but rejects any new ones. At the same time, the front-end proxies (vtgate) begin to buffer such new transactions. Once PRS completes, all buffered transactions are sent to the new primary, and the system resumes without serving any errors to the application."* Two distinct layers playing complementary roles: [[concepts/lameduck-mode|lameduck]] drains the sick node safely; [[concepts/query-buffering-cutover|query buffering]] absorbs the traffic gap. The composition is what makes *"no errors to the application"* possible.

- **"Optimise for the common case" — the load-bearing design principle.** *"A typical cluster could be completing thousands of requests per second. In contrast, a software rollout is likely a daily event. In further contrast, a node failure may happen once a month or even less frequently. It is important that we optimize for the common case. This means that we want leadership changes to be graceful during software rollout. Ideally, the application should see no errors during this time. The approach of demoting the current leader gives us this opportunity."* Canonical wiki articulation of the **failure-frequency-drives-optimisation-axis** heuristic at the distributed-systems altitude. PRS = daily event = invest in zero-error UX. ERS = monthly event = correctness over elegance.

- **Two revocation algorithms are interchangeable across rounds.** *"Let us assume that a leadership is established by satisfying conditions A and B. One algorithm achieves revocation by making condition A false, and the other by making condition B false. In both cases, it is a successful revocation. Once revocation is complete, both algorithms have to make conditions A and B true for the new leader, which will allow for subsequent rounds to use any method of revocation."* Canonical wiki datum: **revocation methods don't need to be uniform across rounds** — the invariant is what needs to hold, not the specific path taken. The composite system can graceful-demote on Monday and fence-followers on Tuesday; both are valid if they satisfy the invariant.

- **Race handling is a separate concern, deferred to the next blog post.** Sougoumarane explicitly scopes this instalment to establishment + revocation, deferring forward-progress, race-handling, and request-propagation to part 5. *"If so, the revocation must be performed against all potential leaders. In other words, the election process must reach enough nodes to be sure that no existing leader can complete their requests. This will become more clear in the next blog where we will cover race conditions."* The post canonicalises that **when the current leader is not known, revocation must fence enough of the follower population to guarantee the old leader cannot complete writes** — the lower bound on who-must-be-reached is determined by the establishment invariant, not by majority-quorum arithmetic.

## Systems

- **[[systems/vitess|Vitess]]** — MySQL sharding substrate originally built at YouTube; canonical wiki instance of separate-revoke-from-establish at the operational-command altitude via PRS + ERS.
- **[[systems/mysql|MySQL]]** — the underlying storage engine whose **semi-synchronous replication** + replication-source pointer provide the primitives Vitess's reparenting mechanism composes.
- **[[systems/planetscale|PlanetScale]]** — publisher of the consensus series (Sougoumarane was PlanetScale co-founder at time of writing, also Vitess co-creator).

## Concepts

- **[[concepts/leader-revocation|Leader revocation]]** *(new)* — the act of invalidating conditions that allow a leader to complete requests; first-class concern separable from establishment.
- **[[concepts/leader-establishment|Leader establishment]]** *(new)* — the act of satisfying the conditions required for a leader to complete requests.
- **[[concepts/lameduck-mode|Lameduck mode]]** *(new)* — the vttablet state in which in-flight transactions are allowed to complete but new transactions are rejected; the drain primitive under graceful demotion.
- **[[concepts/query-buffering-cutover|Query buffering at cutover]]** — existing wiki page; extended with the Vitess PRS reparenting instance alongside the existing MoveTables cutover framing.
- **[[concepts/no-distributed-consensus|No distributed consensus]]** — existing wiki page; this post canonicalises the *other* end of the spectrum (**yes-consensus systems with explicit revoke/establish separation**) as structural contrast.
- **[[concepts/graceful-degradation|Graceful degradation]]** — existing wiki page; reparenting as a graceful-degradation mechanism at the cluster-topology layer.

## Patterns

- **[[patterns/separate-revoke-from-establish|Separate revoke from establish in leader election]]** *(new)* — treat revocation and establishment as independent steps with independently-swappable mechanisms; generalises beyond Vitess.
- **[[patterns/graceful-leader-demotion|Graceful leader demotion for planned transitions]]** *(new)* — ask the current leader to step down + drain in-flight via lameduck + buffer new traffic at proxy tier + flush to new leader; application sees no errors.
- **[[patterns/zero-downtime-reparent-on-degradation|Zero-downtime reparent on degradation]]** — existing pattern (PlanetScale EBS-failure-rate post); this source provides the **algorithmic foundation** for why the pattern works (separate-revoke-from-establish + graceful-leader-demotion composed).

## Operational numbers

- **Software rollouts: daily**. Node failures: monthly or less. (Both framed verbatim as orders-of-magnitude relative to request rate.)
- **Request rate: "thousands of requests per second" per typical cluster** — the implicit denominator for why *"no application-visible errors"* is worth engineering for.
- No disclosed numbers for: PRS latency, ERS latency, lameduck drain time, vtgate buffer depth, buffered-transaction cap, reparent success rate.

## Caveats

- **Theoretical / conceptual post**, not a production retrospective. No production incidents narrated; no PRS/ERS success-rate telemetry; no lameduck-duration distributions; no vtgate buffer-depth statistics.
- **Post defers race-handling to Part 5** (which would become the next instalment in the series). This post's quorum-alternative argument is sketched but not formalised: *"the election process must reach enough nodes to be sure that no existing leader can complete their requests"* — *"enough"* is protocol-dependent, not specified here.
- **Vitess-specific mechanism disclosure is shallow** — PRS + ERS are named and their behavioural contracts summarised, but internal mechanisms (how vtgate knows when PRS completes, how lameduck state is announced, how the new primary is selected, how errant GTIDs affect candidate selection) are elided. For Vitess operational depth see Vitess docs; the [[sources/2026-04-21-planetscale-announcing-vitess-21|Vitess 21 release notes]] canonicalise VTOrc's errant-GTID tracking as a precursor to reparent decisions.
- **Semi-sync MySQL replication framing is illustrative, not prescriptive.** The post uses it as one example of a non-proposal-number revocation mechanism; it does not claim Vitess exclusively uses semi-sync replication for reparenting.
- **"Cutting the network cable" framing is rhetorical.** The Google anecdote illustrates that any invariant-preserving revocation is valid; it is not a recommendation to cut cables in practice.
- **The *"two algorithms are interchangeable across rounds"* claim is conceptual, not production-validated.** Most production systems pick one graceful and one emergency path and don't hot-swap revocation algorithms per round — the interchangeability is an architectural freedom, not a common deployment shape.

## Series context

This is Part 4 of Sougoumarane's 5+ part Consensus Algorithms at Scale series. Referenced predecessors:
- [Part 1](https://planetscale.com/blog/consensus-algorithms-at-scale-part-1) — original consensus properties (Paxos) and why practical scenarios need modification.
- [Part 2](https://planetscale.com/blog/consensus-algorithms-at-scale-part-2) — practical-scenario-ready consensus modified to accept a series of requests; single-leader narrowing.
- [Part 3](https://planetscale.com/blog/consensus-algorithms-at-scale-part-3) — durability-agnostic rules; rejection of majority-quorum as a core building block.
- Part 4 (this post) — revocation + establishment separation; Vitess PRS/ERS as worked instance.
- Part 5 (forthcoming at publication time) — race handling + forward progress.

Part 5 and any subsequent parts are not currently on the wiki; re-scraping PlanetScale's blog would surface them if published.

## Source

- Original: [https://planetscale.com/blog/consensus-algorithms-at-scale-part-4](https://planetscale.com/blog/consensus-algorithms-at-scale-part-4)
- Raw markdown: [`raw/planetscale/2026-04-21-consensus-algorithms-at-scale-part-4-establishment-and-revoc-55cef09a.md`](../../raw/planetscale/2026-04-21-consensus-algorithms-at-scale-part-4-establishment-and-revoc-55cef09a.md)

## Related

- [[systems/vitess]] — the canonical production instance of PRS + ERS as operational commands.
- [[systems/mysql]] — semi-sync replication as a revocation-mechanism primitive.
- [[systems/planetscale]] — publisher; author was co-founder.
- [[concepts/leader-revocation]] — new concept canonicalising the revocation step as a first-class concern.
- [[concepts/leader-establishment]] — new concept canonicalising the establishment step as a first-class concern.
- [[concepts/lameduck-mode]] — new concept canonicalising the drain primitive.
- [[concepts/query-buffering-cutover]] — canonical wiki primitive; extended with PRS instance.
- [[concepts/no-distributed-consensus]] — the structural contrast on the consensus-adoption axis.
- [[patterns/separate-revoke-from-establish]] — new pattern generalising the post's load-bearing insight.
- [[patterns/graceful-leader-demotion]] — new pattern for the planned-transition path.
- [[patterns/zero-downtime-reparent-on-degradation]] — existing pattern (PlanetScale EBS post); this source provides the algorithmic foundation.
- [[companies/planetscale]] — publisher company page.