Skip to content

PATTERN Cited by 1 source

External coordinator for leadership lock

Pattern

Instead of building the leadership-lock primitive inside your consensus protocol (fused with the revoke/establish round as in classical Raft/Paxos), delegate lock acquisition to an external consensus system — typically etcd, ZooKeeper, or Chubby — and run your own per-request data-plane substrate independently.

Why accept the oddity of running one consensus system on top of another? Because the two systems operate at wildly asymmetric cost: - Data plane: thousands of requests per second per cluster → aggressive tuning, direct-to-leader reads, custom replication topology, etc. - Leadership changes: once per day or week → can afford the full round-trip to the external coordinator because the cost is amortised against hundreds of thousands of data-plane ops.

The coordinator is built for strong semantics on the slow path. The data plane is built for throughput on the fast path. The architectural pay-off is decoupling the tuning axes.

Canonical framing

Sugu Sougoumarane's canonical-source passage in Part 5:

"In Vitess, we use an external system like etcd to obtain such a lock. The decision to rely on another consensus system to implement our own may seem odd. But the difference is that Vitess itself can complete a massive number of requests per second. Its usage of etcd is only for changing leadership, which is in the order of once per day or week." (Source: sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-5-handling-races)

The cost-asymmetry argument is the load-bearing justification. "Thousands of requests per second" vs "once per day or week" is 5–6 orders of magnitude — any reasonable coordinator overhead is rounding error against the per-second amortised budget.

The four steps, decoupled

In a lock-based design (patterns/lock-based-leader-election), leadership change has four serial steps:

  1. Acquire lock
  2. Revoke prior leader
  3. Establish new leader
  4. Release lock

The external-coordinator pattern decouples step 1 onto a different substrate from steps 2–3. In Vitess:

  • Step 1 (acquire lock) → etcd. Strong consensus, Raft-backed, external-world-semantic lock.
  • Steps 2–3 (revoke / establish) → Vitess-internal operations. PROMOTE / DEMOTE MySQL primitives, VReplication catch-up, topology updates, vtgate routing-table refresh.
  • Step 4 (release lock) → etcd, usually implicit via lease expiry.

Why this is architecturally clean

The invariant each substrate is responsible for is simpler

Majority-quorum consensus systems (Raft, Paxos) bundle lock + revoke + establish into a single round-trip. The invariant each message satisfies is load-bearing on all three operations, which makes the protocol subtle and makes extending any single concern (e.g. pluggable durability) risky.

Splitting lock acquisition onto a different substrate means: - etcd's invariant: "only one client holds this key's lock at a time." Simple, well-understood. - Vitess's invariant (on top of the lock): "the elector that holds the etcd lock can safely revoke the prior leader and establish a new one." Also simple, because the lock already guarantees race resolution.

Each substrate can be tuned independently

etcd's cluster topology, replication parameters, and lease-renewal cadence are tuned for "is the lock held? who holds it?" — questions asked a few times per cluster per day.

Vitess's data-plane topology, semi-sync replication count, and per-shard replica placement are tuned for "can we service thousands of requests/sec at p99 < 10 ms?" — different axes entirely.

Fusing the two substrates forces compromise on both tuning axes; separating them eliminates the tension.

Composes naturally with pluggable durability

The pattern composes with patterns/pluggable-durability-rules because the lock and the durability predicate are on different substrates. Vitess can specify cross-zone or cross-region durability via VTOrc's plugin API without touching etcd's topology; the durability rule is a Vitess-layer concern; the lock is an etcd-layer concern.

If etcd is temporarily unavailable, Vitess cannot perform leadership changes — but existing leaders keep serving data-plane traffic from the lease granted before the outage. "Any operation that does not want the leader to change just has to obtain a lock before doing its work" — but per-request serving doesn't need a new lock every time. The data plane has a time-bounded independence from the coordinator.

This is a structural improvement over fused designs where losing coordinator quorum means losing the data plane.

Cost of the pattern

Running two consensus systems

Operational overhead: etcd cluster standup, upgrade, monitoring, tuning, capacity planning. For a single-service deployment this is significant; for a multi-service platform it amortises across all services using the same etcd cluster.

Coordinator becomes a SPOF if mis-scoped

If the coordinator is used for more than just leadership locks — e.g., per-request coordination, sequencing, or metadata — it becomes a hard SPOF on the fast path. The pattern specifies "only for changing leadership" precisely to avoid this: the coordinator is on the cold path and failure is tolerable for short windows.

Lease management is application-level

Unlike fused designs, the application must implement lock-lease renewal explicitly. The elector holds an etcd lease, renews it periodically, and must handle revocation-on-expiry cleanly. Lock-loss protocols (what happens when the elector loses the lock mid-operation?) are application-level concerns rather than embedded in the protocol.

Canonical production instances

  • Vitess + VTOrc + etcd — the pattern's canonical instance. VTOrc acquires the etcd lock, runs the MySQL revoke/establish, updates the Vitess topology, releases the lock. "In Vitess, the current leader for each cluster is published through its topology, and a large number of workflows rely on this information to perform various tasks. Any operation that does not want the leader to change just has to obtain a lock before doing its work."
  • Kubernetes + etcd — the API server is the elector; etcd holds both the lock (leader lease for the API server in multi-master setups) and the cluster state. Similar shape: coordinator for locking + metadata, different substrate for workloads.
  • Google Chubby / ZooKeeper — original external-coordinator substrates. Chubby's original paper (Burrows, 2006) names the same cost-asymmetry argument: "loose consensus for coarse-grained locks rather than tight consensus for every operation."
  • HashiCorp Consul / Nomad — same pattern at the service-discovery altitude.

Composition

When to avoid

  • Small self-contained clusters where running two consensus systems is disproportionate overhead.
  • Hard-real-time leadership change SLAs where the extra RPC to the external coordinator is unacceptable.
  • Platforms without an existing coordinator — building etcd/ZooKeeper just for leadership locks is usually worse than implicit-lock Raft.

Sugu's posture: for large-scale production databases and control planes, the cost-asymmetry argument dominates, and the pattern wins. For everything else, implicit-lock Raft is usually simpler.

Seen in

Last updated · 550 distilled / 1,221 read