PATTERN Cited by 1 source

Graceful leader demotion for planned transitions¶

Problem¶

When you need to change leaders for planned reasons — software rollout on the current primary, scheduled node maintenance, capacity rebalance — you have structural options most emergency-failover paths don't:

The current leader is reachable and cooperative — you can ask it to do things.
You can prepare a successor (candidate replica caught up, promoted to writable) before the cutover.
You control the timing — the cutover can wait until both sides are ready.

If you use the same emergency-failover mechanism as an unplanned crash (fence the followers + promote a replica + hope GTIDs line up), you surface application-visible errors during the cutover window and don't exploit any of the structural advantages of the planned case. Yet planned transitions happen daily (software rollout cadence), while crashes happen monthly or less — Sougoumarane's load-bearing framing: "It is important that we optimize for the common case."

Solution¶

A three-layer composition that exploits every advantage of the planned case:

Ask the current leader to step down (the graceful revocation mechanism — requires the leader to be reachable and cooperative).
Drain in-flight work via lameduck mode on the storage node — allow open transactions to complete, reject new ones. This bounds the residual work to what was already admitted and gives the leader time to flush cleanly.
Buffer new traffic at the proxy tier via query buffering — the application-facing proxy holds new transactions in memory while the old leader drains and the new leader is established. No errors surface to the application.
Establish the new leader (promote the prepared replica; update topology; redirect replicas).
Flush the proxy buffer to the new leader. The buffered transactions execute on the new primary; the application never saw the gap.

Canonical instance — Vitess PRS¶

Vitess's PlannedReparentShard command implements this pattern verbatim.

Sougoumarane's disclosure (Source: sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-4-establishment-and-revocation):

"If a PRS is issued, the low level vttablet component of vitess goes into a lameduck mode where it allows in-flight transactions to complete, but rejects any new ones. At the same time, the front-end proxies (vtgate) begin to buffer such new transactions. Once PRS completes, all buffered transactions are sent to the new primary, and the system resumes without serving any errors to the application."

The two-tier composition is what makes "no errors to the application" achievable:

       Application writes
              │
              ↓
   ┌──────────────────────┐
   │   vtgate (proxy)     │   ← buffer new transactions here
   │                      │     during cutover window
   └──────────┬───────────┘
              │
   ┌──────────┴───────────┐
   │       vttablet       │
   │    (old primary)     │   ← lameduck: drain in-flight,
   │                      │     reject new
   └──────────────────────┘

              │
              │  PRS selects new primary,
              │  promotes replica, updates topology
              ↓

   ┌──────────────────────┐
   │       vttablet       │
   │    (new primary)     │
   └──────────┬───────────┘
              ↑
   ┌──────────┴───────────┐
   │   vtgate (proxy)     │   ← flush buffer to new primary
   │                      │
   └──────────────────────┘

Why this beats the emergency path for planned transitions¶

The emergency path (fence-the-followers, as in Vitess's EmergencyReparentShard) is correctness-first: its job is to guarantee at-most-one-leader when the old primary is unreachable. It necessarily has a window during which the application sees errors — detection latency + fence latency + promotion latency.

The graceful path is UX-first: because the old primary is reachable and cooperative, you can orchestrate a seamless handoff. The engineering cost is the two-tier composition; the payoff is zero application-visible errors on every software rollout.

Sougoumarane on the cost-benefit calculus: "A typical cluster could be completing thousands of requests per second. In contrast, a software rollout is likely a daily event. In further contrast, a node failure may happen once a month or even less frequently." Invest in the daily case.

When to use¶

You have planned leader transitions (software rollouts, planned maintenance) as a routine operational event — daily or more often.
The current leader is reliably reachable and cooperative before the transition — not crashed, not partitioned, not hung.
You have a caught-up replica ready to promote — the establishment step has a prepared substrate.
Zero application-visible errors is an explicit requirement — customer-facing OLTP, SaaS platforms, high-concurrency revenue paths.
You can afford a proxy tier that absorbs traffic during the cutover window.

When not to use¶

Unplanned failover — the leader is unreachable or hung. Use an emergency-fencing path instead; see patterns/separate-revoke-from-establish for the structural split.
No proxy tier available — without a buffering layer, new traffic has nowhere to go during the cutover window; clients will see connection errors.
Leadership transitions are too rare to justify the engineering — if you transition leaders once a year, the two-tier machinery is overbuilt.
The leader's step-down mechanism isn't reliable — if "please step down" can be ignored or hang, you don't have the graceful substrate; degrade to emergency path.

Trade-offs¶

Proxy buffer depth is finite. If PRS stalls, the buffer fills and clients see errors anyway. Production deployments set buffer caps + timeout + fallback.
Lameduck drain time bounds rollout speed. Long-running transactions at cutover time delay the transition; production paths cap drain time and force-terminate stragglers.
The step-down signal must be reliable. If the old leader doesn't honour step-down, the pattern degrades to emergency fencing — you need a timeout + escalation path.
Composing lameduck with query buffering requires cooperation between two tiers (storage + proxy) that may be maintained by different teams. Operational contract between them is load-bearing.
Correlated-failure caveat — if the new primary is on the same underlying substrate (same AZ, same EBS volume class, same hypervisor) as the old one, graceful handoff doesn't help against a substrate-wide event. See patterns/zero-downtime-reparent-on-degradation for the on-degradation variant + patterns/shared-nothing-storage-topology for the structural fix.

Seen in¶

sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-4-establishment-and-revocation — canonical wiki introduction; Sougoumarane canonicalises the pattern via Vitess PRS and the lameduck + query-buffering composition.
— PlanetScale's production deployment of graceful reparent as the on-degradation mitigation; see patterns/zero-downtime-reparent-on-degradation for the degradation-triggered variant.

patterns/separate-revoke-from-establish — the higher-level pattern this composes on.
patterns/zero-downtime-reparent-on-degradation — the on-degradation deployment of this mechanism; composes the same lameduck + buffering primitives with automated detection.
concepts/leader-revocation — the revocation step; here achieved by step-down.
concepts/leader-establishment — the establishment step; here achieved by promoting the prepared replica.
concepts/lameduck-mode — the drain primitive on the storage node.
concepts/query-buffering-cutover — the proxy-tier primitive that absorbs traffic during cutover.
systems/vitess — canonical production instance via PRS + vttablet + vtgate.