PLANETSCALE 2022-07-01

PlanetScale — Consensus algorithms at scale: Part 7 - Propagating requests¶

Summary¶

Seventh instalment of Sugu Sougoumarane's Consensus algorithms at scale series on the PlanetScale blog (originally 2022-07-01; Sugu is Vitess co-creator, ex-YouTube, PlanetScale CTO). Part 7 is — by Sugu's own framing — the most difficult part of the series: "We have saved the most difficult part for last. This is where we put it all together." The task is to satisfy the series' core durability guarantee across a leadership change: propagate previously completed requests to satisfy the new leader's durability requirements. All preceding instalments (durability redefinition, the revoke/establish split, race handling, request completion) feed into this one concern. If propagation is wrong, the system can lose a completed request across a failover — the single failure mode the whole framework exists to prevent.

The post splits propagation into two regimes. The planned-change regime (lock-based systems, graceful demotion) is the easy case: the current leader is reachable, so "the current leader could ensure that its requests have reached all the necessary followers before demoting itself. Once this is done, the elector performs the leadership change and the system can resume." The leader satisfies the new durability requirement before the hand-off happens — canonical new patterns/graceful-propagation-before-demotion pattern. The failure regime is where the hard problems live: the elector indirectly revokes the previous leadership by fencing enough followers that the old leader can no longer meet its durability criteria. That fencing process simultaneously discovers previously-completed requests that must be propagated onto the new leader's follower set. The two jobs — revoke the old and discover the durable — share the same primitive (reaching a sufficient set of followers) but face seven distinct failure modes that make propagation "not as simple as it sounds."

The canonical failure modes Sugu enumerates: (a) a request may be incomplete — the elector may or may not discover it; (b) a discovered tentative request may be durable or not, and the elector cannot distinguish; (c) propagation of a request can itself fail before completion; (d) an elector that misses an incomplete request can elect a new leader that accepts a new conflicting request that also fails before completion; (e) a subsequent elector can then discover multiple incomplete conflicting requests; (f) an elector may propagate one of them as tentative and fail before marking it complete; (g) a final elector can discover both a durable propagated request and a newer incomplete conflicting one, and have insufficient information to pick correctly. The load-bearing insight: you cannot collapse all seven failure modes into a single per-request boolean "was this completed?" query — the elector's knowledge is fundamentally incomplete, and the protocol must tolerate that by encoding enough information on-the-wire to let later electors resolve the ambiguity.

The resolution is a single, elegant rule: every request carries a time-based version, and later versions supersede earlier ones. "It is safe to propagate the latest discovered decision. A decision to propagate a previous decision is a new decision." Canonical new concepts/request-versioning concept + patterns/version-per-request-to-resolve-conflicts pattern. The safety argument is that if an elector discovers two conflicting versions, the newer one's existence proves the older one never completed — because if the older one had completed durably, the elector that installed the newer one would necessarily have discovered the older one and propagated it instead. So "honour the newer version" is strictly safe. Paxos uses proposal numbers for this; Raft uses leadership term numbers; the underlying concept is the same and generalises beyond those two substrates (timestamps, GTID-plus-term-number tuples, leadership identity bits — any total-ordering mechanism works).

The post closes with a production shortcut that real large-scale MySQL systems (YouTube, Vitess) actually rely on: anti-flapping rules make the propagation race rare enough that full per-request versioning becomes operationally optional. "Most large-scale systems have anti-flapping rules that prevent a leadership from changing as soon as one was performed … Serendipitously, anti-flapping rules also mitigate the failure modes described above. Versioning of in-flight requests is less important for such systems." Canonical new concepts/anti-flapping concept. Worked production instance: MySQL's binlog carries GTIDs + timestamps for every transaction; those GTIDs get faithfully propagated to replicas (which violates the strict per-version-on-every-new-decision rule) but Orchestrator's built-in anti-flapping rules compensate — "This is the reason why organizations have been able to avoid split-brain scenarios while running MySQL at a massive scale." Vitess inherits the same safety properties through VTOrc (customised Orchestrator fork).

Key takeaways¶

Propagation is the load-bearing final concern of consensus. Sugu canonicalises it as the hardest part of the framework: "Propagate previously completed requests to satisfy the new leader's durability requirements." (Source: sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-7-propagating-requests). Durability across a leadership change = the elector's ability to surface requests that were durable under the old leader onto the new leader's follower set, faithfully enough to meet the new leader's durability criterion. If propagation fails, a completed request can be lost across a failover — the single failure mode the entire framework exists to prevent.
Two regimes: planned-change vs failure-driven. Lock-based + planned case = "we have the opportunity to request the current leader to demote itself. In this situation, the current leader could ensure that its requests have reached all the necessary followers before demoting itself. Once this is done, the elector performs the leadership change and the system can resume." Failure case = "the elector must indirectly revoke the previous leadership by requesting the followers to stop accepting any more requests from that leader. If enough followers are reached such that the previous leader cannot meet the durability criteria for any more requests, we know that the revocation is successful." Two separate mechanisms for the same invariant; the failure-driven mechanism carries all the protocol weight because the planned-case mechanism cannot be relied on (leaders sometimes crash).
Revocation and propagation share a primitive. The same "reach enough followers to stop the old leader from meeting durability" fencing operation also exposes all previously-completed requests — "This method, apart from guaranteeing that no further requests will be completed by that leader, also allows us to discover all requests that were previously completed." Canonical design datum: the fencing and discovery operations can be co-designed as one protocol step rather than two. This composition is what makes Raft/Paxos's majority-quorum proposal work at all — reaching a quorum of followers revokes and discovers at once.
Seven failure modes of propagation canonicalised. Sugu enumerates them verbatim: "There may be a request that is incomplete. In this situation, the elector may or may not discover this request … An elector that discovers a tentative request may not be able to determine if that request has become durable … Propagation of a request can fail before completion … An elector that does not discover an incomplete request could elect a new leader that accepts a new request, which may fail before completion … A subsequent elector may discover such multiple incomplete requests … Another elector may discover only one of the incomplete requests, may propagate it as tentative, and fail before marking it as complete … A final elector can discover this durable request, and a newer conflicting incomplete request, and may not have enough information to know which one to honor." The enumeration is exhaustive because each failure mode is a distinct interaction between elector knowledge (complete vs partial discovery) and request lifecycle (incomplete vs tentative vs durable). Canonical concepts/incomplete-request concept + the observation that no single-boolean "is this request durable?" query can resolve the space — the protocol must carry sufficient on-wire state.
Elector capabilities invariant. "An elector must be able to reach a sufficient number of followers to revoke the previous leadership. If this is not possible, the elector is blocked." Conversely, "An elector need not (and may not be able to) reach all the followers of a leader." The lower bound is the revocation-durability threshold; the upper bound is unbounded (no requirement to contact all followers). This gap is what creates the seven failure modes — the elector's knowledge is structurally incomplete and the protocol has to be correct in spite of that.
Four guaranteed inferences. Sugu's load-bearing reasoning base: "An elector is guaranteed to find all requests that have become durable." (The durability threshold and the revocation threshold are linked; by definition, reaching enough followers to revoke does reach at least one follower of every durable request.) "If a request was incomplete, an elector may not find it. If not found, it is free to move forward without that request. When that request is later discovered, it must be canceled." (Not-found-incomplete → move on; later-discovered → cancel.) "If an elector discovers an incomplete request, it may not have sufficient information to know if that request was actually durable or complete. Therefore, it has to assume that it might have completed, and attempt to propagate it." (When in doubt, propagate — the safer direction.) "If an elector discovers an incomplete request and can determine with certainty that it was incomplete, it can choose either option: act as if it was discovered, or not discovered." (Certain-incomplete → either choice safe.)
The resolution rule: time-based versioning. "It is safe to propagate the latest discovered decision. A decision to propagate a previous decision is a new decision." Canonical concepts/request-versioning concept. Four-step protocol: "(1) Every request has a time-based version. (2) A leader will create its request using a newer version than any previous request. (3) An elector that chooses to propagate an incomplete request will do so under a new version. (4) An elector that discovers multiple conflicting requests must choose to propagate the latest version." The subtle and load-bearing rule is (3) — propagating an old request assigns it a new version rather than preserving the original. Under rule (3), two elector-propagations of the same old request are distinguishable from each other and from the original, which is what keeps the conflict-resolution invariant sound.
"Completed requests do not need versioning." Structural simplification: once a request is known to be durable + complete, no further elector will conflict with it. Only incomplete requests need the version tag. This is the concrete efficiency payoff of separating the completion and propagation concerns.
Two corner-case arguments for the versioning rule. "If we discover two conflicting requests, it means that the latest request was created because the previous elector did not discover the old one. This essentially means that the old one definitely did not complete. So, it is safe to honor the new elector's decision." The existence of the newer version is itself evidence that the older one never completed durably — a sound inference because a durable old request would have been discovered by the elector that installed the newer one. "If we propagate an existing request, it is also under a new version. It will therefore need to satisfy durability requirements under the new version without conflating itself with the old version." This is the subtle reason propagation-under-a-new-version works: the propagated request now has to re-earn durability under its new version; if propagation fails, a later elector sees only the original-version's incomplete state, not the propagated-version's incomplete state.
Paxos proposal numbers and Raft term numbers are the canonical implementations. "Paxos uses proposal numbers to version its decisions, and Raft uses leadership term numbers." Canonical wiki datum: the two protocols' most-discussed primitive — proposal numbers for Paxos, term numbers for Raft — is the versioning mechanism from this exact rule, specialised to their respective race-resolution layers. Reading the series backwards: concepts/proposal-number was introduced in Part 5 at the race-resolution layer (newer-proposal-number elector wins); here it reappears at the request-propagation layer (newer-version request wins). Same mechanism, two different layers.
Alternative version mechanisms are valid. "But you can use other methods for versioning. For example, one could assign timestamps for the requests instead of using leadership terms or proposal numbers." Canonical flexibility: any total-ordering mechanism works (Lamport timestamps, hybrid logical clocks, leadership-epoch-plus-sequence tuples, true wall-clock with sufficient skew margin). The mechanism is orthogonal to the protocol; the specific choice is a performance/clock-sensitivity trade-off.
Anti-flapping canonicalised as the production shortcut. "Most large-scale systems have anti-flapping rules that prevent a leadership from changing as soon as one was performed. This is because such an occurrence is usually due to a deeper underlying problem, and performing another leadership change will likely not fix it. And in most cases, it would aggravate the underlying problem." Canonical new concepts/anti-flapping concept. The anti-flapping window serves two purposes: (1) prevents the failure-loop where each leadership change triggers the next; (2) prevents the propagation-race failure modes from firing by serialising leadership changes far enough apart that an in-flight elector has time to finish before a successor one starts. Canonical production-debugging anecdote: "In one of the systems that I knew of, the payload of the request was so big that it was causing the transmission to timeout. This resulted in a failure being detected and caused a leadership change. However, the new leader was also incapable of completing the request due to the same underlying problem. The problem was ultimately remedied by increasing the timeout." The deeper-underlying-problem framing is the reason anti-flapping exists; the propagation-race-mitigation is the serendipitous second-order effect.
MySQL binlogs carry version metadata natively. "The MySQL binlogs contain metadata about all transactions. They carry two pieces of relevant information: (1) A Global Transaction ID (GTID), which includes the identity of the leader that created the transaction. (2) A timestamp. This metadata is faithfully propagated to all replicas. This information is sufficient to resolve most ambiguities if conflicting transactions are found due to failures." Canonical worked instance: MySQL's GTID + timestamp is the per-request version metadata the protocol requires. The MySQL replication substrate already satisfies the version-per-request requirement out of the box — no custom protocol layer needed. Sugu flags the caveat: "However, the faithful propagation of the transaction metadata breaks the versioning rule that the decision of a new elector must be recorded under a new timestamp." MySQL-on-its-own would fail rule (3) because replicated GTIDs carry the original leader's identity + timestamp, not the new leader's. The anti-flapping layer compensates.
Orchestrator is the canonical external anti-flapping + consensus wrapper for MySQL. "The Orchestrator, which is the most popular leadership management system for MySQL, has built-in anti-flapping rules. These rules mitigate the above failure modes. This is the reason why organizations have been able to avoid split-brain scenarios while running MySQL at a massive scale." Canonical new Orchestrator system page. VTOrc inherits this: "In Vitess, we use VTorc, which is a customized version of the Orchestrator, and we inherit the same safeties. But we also intend to tighten some of these corner cases to minimize the need for humans to intervene if complex failures ever happen to occur." Canonical wiki datum: VTOrc is an Orchestrator fork — an architectural-lineage disclosure the prior parts didn't expose. Vitess's consensus story is therefore a composition: Orchestrator's anti-flapping + MySQL GTID/timestamps + etcd lock + VTOrc's tightened corner cases = a lock-based consensus stack that satisfies the durability-propagation invariant at large scale without running Paxos or Raft directly.
Architectural summary: propagation is the most complex concern in consensus in theory, but large-scale MySQL systems sidestep its full complexity in practice via anti-flapping rules that make the seven-failure-mode space rare enough that the MySQL-native version metadata (GTID + timestamp) is sufficient. The "pure" protocol (per-request versioning enforced on every elector action) remains valid but is seldom worth paying for — another instance of the series' running theme that real systems make engineering trade-offs the textbook algorithms don't acknowledge.

Canonical new pages created¶

Concepts¶

concepts/request-propagation — the load-bearing final concern of consensus: satisfying durability across a leadership change by surfacing requests durable under the old leader onto the new leader's follower set. Two regimes (planned vs failure). Failure-driven regime composes revocation and discovery into one protocol step.
concepts/incomplete-request — a request for which the elector cannot determine durability; the central concept around which the seven failure modes are organised. Sugu's four-inference reasoning base (guaranteed-to-find-durable, not-found-incomplete-is-safe-to-drop-then-cancel-later, found-incomplete-assume-durable-and-propagate, certain-incomplete-either-choice-safe) is the canonical handling rule.
concepts/request-versioning — time-based version attached to every in-flight request. Generalises Paxos proposal numbers + Raft term numbers to a single concept, layered one level below those at the request layer (Part 5's proposal-number was at the election layer; this one is at the request layer). Four-step protocol rule; completed requests don't need versioning.
concepts/anti-flapping — rate-limiting rule on leadership changes. Primary purpose is failure-loop avoidance ("leadership change usually due to a deeper underlying problem"); serendipitous second-order effect is propagation-race mitigation. Sugu's canonical framing: "Versioning of in-flight requests is less important for such systems."

Patterns¶

patterns/version-per-request-to-resolve-conflicts — attach a time-based version to every request; later versions supersede earlier; propagation assigns a new version rather than preserving the original. The core resolution pattern for the seven propagation failure modes. Paxos/Raft instances, plus timestamp and MySQL GTID variants.
patterns/graceful-propagation-before-demotion — planned-change pattern where the current leader ensures all requests have reached the new leader's required followers before stepping down. Extends the Part-4 patterns/graceful-leader-demotion pattern with the propagation-completion invariant. Works only if current leader is reachable; not a substitute for the failure-driven path.
patterns/external-metadata-for-conflict-resolution — replicate per-transaction metadata (ID + timestamp + leader identity) alongside the transaction to give later coordinators enough information to resolve ambiguities without protocol round-trips. MySQL GTID + timestamp + Orchestrator-based anti-flapping is the canonical worked instance.

Systems¶

Orchestrator — canonical new system page. The most popular MySQL leadership-management system (openark/orchestrator). Ships with built-in anti-flapping rules that compensate for MySQL binlog's faithful-replication-of-GTIDs (which would otherwise break the per-request-new-version rule). The anti-flapping primitive is what allows MySQL at massive scale to avoid split-brain. VTOrc is a customised fork of Orchestrator.
VTOrc — canonical new system page (previously referenced from Part 5 source but no dedicated page). Vitess's elector: customised fork of Orchestrator with the same anti-flapping safeties plus tightened corner cases. One VTOrc per region typically; multiple VTOrcs per cluster participate in leadership-change races resolved via an etcd-backed lock.

Relationship to prior instalments¶

Part 4 (Establishment and revocation) introduced the revoke/establish split and the graceful-demotion path; Part 7 extends graceful demotion with the pre-demotion-propagation step.
Part 5 (Handling races) canonicalised the lock-based vs lock-free taxonomy and the proposal-number concept at the election layer; Part 7 reuses the same versioning idea at the request layer.
Part 6 (Completing requests) (not yet ingested) covered request completion as a precursor to propagation; Part 7 picks up from there.
Part 8 (Closing thoughts) (not yet ingested) pulls everything together and concludes the series.

Caveats¶

No production numbers — no split-brain incident count, no propagation-failure rates, no anti-flapping window tuning guidance in time units, no measured latency for graceful-demotion-with-propagation vs failure-driven path.
VTOrc "tightened corner cases" are asserted but not disclosed — "we also intend to tighten some of these corner cases" — the intent is canonicalised but the actual tightening is future work as of publication.
No formal specification of the propagation protocol — failure modes are enumerated in prose; the four guaranteed inferences are stated as facts without formal proof.
MySQL-GTID-breaks-rule-(3) caveat is noted but not quantified — how often the anti-flapping window fails to compensate, whether there are worked historical incidents, and what the residual-risk envelope looks like, are all left implicit.
Orchestrator is named but not architecturally disclosed — the post cites the anti-flapping feature without walking through the implementation. The systems/orchestrator wiki page created here is a stub based on this post's framing + external GitHub link; future ingests should enrich.
Request-completion (Part 6) is referenced as a precursor but not summarised, so the "tentative" request lifecycle state this post depends on is partially canonicalised here and should be re-anchored when Part 6 is ingested.
Timestamp-based versioning caveat on clock skew (from Part 5's frame) is not re-stated — the post gestures at timestamps as a valid alternative to proposal/term numbers but doesn't re-surface the clock-skew margin discussion.
Seven failure modes are enumerated but not individually walked through with worked examples — the enumeration is complete but each mode is a one-sentence description; operators debugging a real failure would need to synthesise the reasoning themselves.