PATTERN Cited by 1 source
External coordinator for leadership lock¶
Pattern¶
Instead of building the leadership-lock primitive inside your consensus protocol (fused with the revoke/establish round as in classical Raft/Paxos), delegate lock acquisition to an external consensus system — typically etcd, ZooKeeper, or Chubby — and run your own per-request data-plane substrate independently.
Why accept the oddity of running one consensus system on top of another? Because the two systems operate at wildly asymmetric cost: - Data plane: thousands of requests per second per cluster → aggressive tuning, direct-to-leader reads, custom replication topology, etc. - Leadership changes: once per day or week → can afford the full round-trip to the external coordinator because the cost is amortised against hundreds of thousands of data-plane ops.
The coordinator is built for strong semantics on the slow path. The data plane is built for throughput on the fast path. The architectural pay-off is decoupling the tuning axes.
Canonical framing¶
Sugu Sougoumarane's canonical-source passage in Part 5:
"In Vitess, we use an external system like etcd to obtain such a lock. The decision to rely on another consensus system to implement our own may seem odd. But the difference is that Vitess itself can complete a massive number of requests per second. Its usage of etcd is only for changing leadership, which is in the order of once per day or week." (Source: sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-5-handling-races)
The cost-asymmetry argument is the load-bearing justification. "Thousands of requests per second" vs "once per day or week" is 5–6 orders of magnitude — any reasonable coordinator overhead is rounding error against the per-second amortised budget.
The four steps, decoupled¶
In a lock-based design (patterns/lock-based-leader-election), leadership change has four serial steps:
- Acquire lock
- Revoke prior leader
- Establish new leader
- Release lock
The external-coordinator pattern decouples step 1 onto a different substrate from steps 2–3. In Vitess:
- Step 1 (acquire lock) → etcd. Strong consensus, Raft-backed, external-world-semantic lock.
- Steps 2–3 (revoke / establish) → Vitess-internal operations.
PROMOTE/DEMOTEMySQL primitives, VReplication catch-up, topology updates, vtgate routing-table refresh. - Step 4 (release lock) → etcd, usually implicit via lease expiry.
Why this is architecturally clean¶
The invariant each substrate is responsible for is simpler¶
Majority-quorum consensus systems (Raft, Paxos) bundle lock + revoke + establish into a single round-trip. The invariant each message satisfies is load-bearing on all three operations, which makes the protocol subtle and makes extending any single concern (e.g. pluggable durability) risky.
Splitting lock acquisition onto a different substrate means: - etcd's invariant: "only one client holds this key's lock at a time." Simple, well-understood. - Vitess's invariant (on top of the lock): "the elector that holds the etcd lock can safely revoke the prior leader and establish a new one." Also simple, because the lock already guarantees race resolution.
Each substrate can be tuned independently¶
etcd's cluster topology, replication parameters, and lease-renewal cadence are tuned for "is the lock held? who holds it?" — questions asked a few times per cluster per day.
Vitess's data-plane topology, semi-sync replication count, and per-shard replica placement are tuned for "can we service thousands of requests/sec at p99 < 10 ms?" — different axes entirely.
Fusing the two substrates forces compromise on both tuning axes; separating them eliminates the tension.
Composes naturally with pluggable durability¶
The pattern composes with patterns/pluggable-durability-rules because the lock and the durability predicate are on different substrates. Vitess can specify cross-zone or cross-region durability via VTOrc's plugin API without touching etcd's topology; the durability rule is a Vitess-layer concern; the lock is an etcd-layer concern.
Related benefit: graceful-degradation-under-coordinator-outage¶
If etcd is temporarily unavailable, Vitess cannot perform leadership changes — but existing leaders keep serving data-plane traffic from the lease granted before the outage. "Any operation that does not want the leader to change just has to obtain a lock before doing its work" — but per-request serving doesn't need a new lock every time. The data plane has a time-bounded independence from the coordinator.
This is a structural improvement over fused designs where losing coordinator quorum means losing the data plane.
Cost of the pattern¶
Running two consensus systems¶
Operational overhead: etcd cluster standup, upgrade, monitoring, tuning, capacity planning. For a single-service deployment this is significant; for a multi-service platform it amortises across all services using the same etcd cluster.
Coordinator becomes a SPOF if mis-scoped¶
If the coordinator is used for more than just leadership locks — e.g., per-request coordination, sequencing, or metadata — it becomes a hard SPOF on the fast path. The pattern specifies "only for changing leadership" precisely to avoid this: the coordinator is on the cold path and failure is tolerable for short windows.
Lease management is application-level¶
Unlike fused designs, the application must implement lock-lease renewal explicitly. The elector holds an etcd lease, renews it periodically, and must handle revocation-on-expiry cleanly. Lock-loss protocols (what happens when the elector loses the lock mid-operation?) are application-level concerns rather than embedded in the protocol.
Canonical production instances¶
- Vitess + VTOrc + etcd — the pattern's canonical instance. VTOrc acquires the etcd lock, runs the MySQL revoke/establish, updates the Vitess topology, releases the lock. "In Vitess, the current leader for each cluster is published through its topology, and a large number of workflows rely on this information to perform various tasks. Any operation that does not want the leader to change just has to obtain a lock before doing its work."
- Kubernetes + etcd — the API server is the elector; etcd holds both the lock (leader lease for the API server in multi-master setups) and the cluster state. Similar shape: coordinator for locking + metadata, different substrate for workloads.
- Google Chubby / ZooKeeper — original external-coordinator substrates. Chubby's original paper (Burrows, 2006) names the same cost-asymmetry argument: "loose consensus for coarse-grained locks rather than tight consensus for every operation."
- HashiCorp Consul / Nomad — same pattern at the service-discovery altitude.
Composition¶
- With patterns/lock-based-leader-election — the external-coordinator pattern is a specialisation; lock-based is the parent family.
- With patterns/pluggable-durability-rules — orthogonal; composable. Vitess composes both.
- With patterns/separate-revoke-from-establish — structurally enabled by the pattern; the external lock is step 1, revoke/establish are steps 2–3 on a different substrate.
When to avoid¶
- Small self-contained clusters where running two consensus systems is disproportionate overhead.
- Hard-real-time leadership change SLAs where the extra RPC to the external coordinator is unacceptable.
- Platforms without an existing coordinator — building etcd/ZooKeeper just for leadership locks is usually worse than implicit-lock Raft.
Sugu's posture: for large-scale production databases and control planes, the cost-asymmetry argument dominates, and the pattern wins. For everything else, implicit-lock Raft is usually simpler.
Seen in¶
- sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-5-handling-races — canonical-source post. Introduces the pattern with Vitess+etcd as the canonical instance; the cost-asymmetry argument is stated explicitly; the "once per day or week" vs "massive number of requests per second" framing is canonicalised.