PlanetScale — Consensus algorithms at scale: Part 6 - Completing requests¶
Summary¶
Sixth instalment of Sugu Sougoumarane's Consensus algorithms at scale series on the PlanetScale blog (originally 2022-06-21; Sugu is the co-creator of Vitess, ex-YouTube, PlanetScale CTO). Having separated revocation from establishment in part 4 and the lock-based vs lock-free race-resolution axis in part 5, part 6 tackles the per-request completion protocol: once a leader is stable, how does a single client write become durable without ever being forgotten and without being committed before durability is met? Sugu's answer is a two-phase protocol. The leader first transmits the payload as a tentative request to all nodes. Followers responsible for the leader's durability ack receipt. Once the leader gathers enough acks to satisfy the durability requirement, the request is durable — an implicit middle stage that cannot be cancelled. The leader then sends a completion message telling followers to materialise the effect. A request that fails to reach durability may be retried indefinitely by the leader, or cancelled by a subsequent elector during a leadership change — but never both.
The load-bearing insight is the three-stage request lifecycle (incomplete → durable → complete) with an explicit mutual-exclusion invariant between completion and cancellation. The tentative marker gives followers the vocabulary to delete abandoned work cleanly; the durable stage gives the leader a safe point to respond to the client; the completion message gives followers the trigger to materialise effects. A small but important optimisation Sugu names: once a request is durable, the leader may skip the tentative step for followers that have not yet received the message and send complete directly, saving one round per lagging follower. The post also canonicalises the client-ack-timing trade-off: responding on durability saves one round-trip but forces quorum reads; waiting for completion costs an extra round but enables cheap leader-local reads under a lease. The post closes with a blunt criticism of the MySQL semi-sync protocol: it lacks this two-phase separation ("a replica receives a request, it immediately applies it"), and a crashed primary restarting without verifying durability of in-flight work can produce split-brain. Both behaviours are called out as concrete production hazards that the two-phase shape was designed to avoid.
Key takeaways¶
-
The primary requirement of a consensus system is that it must not forget an acknowledged request. "The primary requirement of a consensus system is that it must not forget a request it has acknowledged as accepted." (Source: sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-6-completing-requests). This single invariant drives the entire two-phase shape — the cancellation path exists to handle requests that haven't yet been acknowledged, not those that have.
-
A two-phase protocol separates receipt from materialisation. "The leader first transmits the payload of the request as tentative to all the nodes. A tentative request is one that can later be completed or canceled. A follower that is responsible for a leader's durability should acknowledge receipt of tentative requests. Once the leader receives the necessary acknowledgements from its followers, the request has become durable and cannot be canceled. The leader can then issue messages to complete the tentative request." The tentative state is the affordance that lets followers hold received-but-not-yet-durable work in a form that is cheap to discard.
-
A request has three stages, one of them implicit. "1. Incomplete: The request is in-flight and has not met the durability requirements. Such a request is marked as tentative among the followers. It can later be completed or canceled. 2. Durable: The request is in-flight, but has met the durability requirements. This is an implicit, but important stage. We can trust that a durable request will never be canceled. 3. Complete: The request is complete. Followers can mark the request as complete and perform any post-completion materializations as needed." The implicit durable stage is the safety bedrock — it exists as a property of the state of the system even if no single message carries the name.
-
Optimisation: once durable, skip tentative for lagging followers. "Once a request becomes durable, the leader is free to transmit that request as complete to followers that have not yet received the message as a single step." Saves one round-trip per lagging follower; safe because once the request is durable, no elector can cancel it, so the tentative-state affordance is no longer needed.
-
A leader that can't reach durability must keep retrying, not cancel. "A leader that fails to make a request durable can keep retrying, but it must not attempt to cancel the request. This is because an elector could be performing a leadership change and may attempt to propagate the failed request." Cancellation is the elector's prerogative during a leadership change (part 7 territory), not the leader's during normal operation. Two parties touching cancellation would race.
-
Completion and cancellation are mutually exclusive. "Completion and cancellation are mutually exclusive: A request that was completed will never be canceled, and a request that was canceled will never be completed." Canonical wiki invariant — this is the safety property of the two-phase protocol, and is what gives clients a semantic foothold for their own retry / idempotency logic.
-
Follower actions at each transition are symmetric and cheap. "When a follower receives a message to complete a request, it can perform the necessary steps to materialize the effects of the request. For example, if the request was meant to change the value of a variable, this change can now be applied. If a cancellation message is received, the follower can delete the request, as if it never took place." Completion = apply-effect. Cancellation = delete-tentative-record. Both are local and idempotent.
-
The client-ack timing trade-off: early ack (durability) vs late ack (completion). "A leader could respond to the client with a success message as soon as it has become durable. However, it has the option of delaying the acknowledgement until it has also sent the completion message to the followers. Waiting until completion costs two round-trips and is therefore slower than an early response. On the other hand, it improves the performance of quorum reads." Early ack = one-RT latency but reads must quorum. Late ack = two-RT latency but reads can hit the leader directly. This is where the consistent-read strategy of the system feeds back into commit-path design.
-
Lock-based systems with leader leases dodge the trade-off. "For systems that use lock-based failovers, reads can be sent to the current leader instead of performing quorum reads. This allows for the leader to respond as soon as it has received the necessary acknowledgements." Under a lock-based election with a leader lease, leader-local reads are consistent without a quorum round-trip, which means the client-ack can happen at the durable stage without paying the quorum-read tax. This is the cross-instalment payoff: the commit-path and the read-path optimisations compose.
-
MySQL semi-sync lacks the two-phase shape. "The MySQL semi-sync protocol does not support this two-phase method of completing requests. When a replica receives a request, it immediately applies it." The replica has no tentative state — it applies-on-receive. This "introduces some corner cases that require mitigation"; Sugu points readers to his older post Distributed durability in MySQL for the catalogue.
-
MySQL's restart-after-crash behaviour is a split-brain hazard. "A primary that is restarted after a crash completes all in-flight requests without verifying that they received the necessary acks. This could lead to 'split-brain' scenarios." The two-phase shape would catch this: a restarting primary would see in-flight work as tentative and would need an explicit durability check before completing; MySQL's shape skips that gate and can therefore surface writes that were never durable to pre-crash reads. Named as a concrete production hazard on the wiki for the first time.
-
The next post will cover request propagation during a leadership change. "In the next post, we will look at request propagation, which will tie all of this together." The tentative-marker + mutual-exclusion invariant is the precondition for the elector (part 5) to safely propagate in-flight tentative requests during a leadership change (part 7 — not yet ingested).
Systems extracted¶
- Vitess — the sharded-MySQL substrate behind PlanetScale. The series argues Vitess-style operational systems benefit from the two-phase protocol + leader-lease combination even though Vitess itself runs on top of MySQL semi-sync, whose lack of the two-phase shape is the concrete hazard Sugu is warning about.
- MySQL — extended with a new Seen-in disclosure. The split-brain hazard on MySQL semi-sync is called out explicitly: the replica applies-on-receive (no tentative state), and the primary's restart-after-crash protocol completes in-flight work without re-checking durability.
- Raft — implicitly referenced via the series framing; Raft bundles the two-phase shape inside its log-replication protocol (entries are appended before being committed, and committed entries are applied) but does not expose a separable tentative/complete vocabulary at the protocol API level.
- Paxos — implicitly referenced via the series framing; the prepare/accept split has a similar two-phase shape at the leadership layer, but per-request commit in Multi-Paxos shares the same implicit-durable structure.
Concepts canonicalised¶
- concepts/tentative-request — new canonical page. A request that has been transmitted by the leader to followers but has not yet met the durability requirement. Can later be completed or cancelled. The affordance that makes safe cancellation of abandoned work possible.
- concepts/durable-request — new canonical page. A request that has met the durability requirement. This is an implicit stage — no single message carries the name, but the system-state property is real and load-bearing: a durable request can never be cancelled.
- concepts/two-phase-completion-protocol — new canonical page. The "tentative → durable → complete" lifecycle with tentative and completion as two explicit follower-visible transitions and durable as the implicit middle. Distinct from two-phase commit (the transactional-atomicity protocol) — similar shape, different invariant.
- concepts/request-cancellation — new canonical page. The elector's prerogative during a leadership change to invalidate a non-durable in-flight request. Mutually exclusive with completion.
- concepts/mysql-semi-sync-split-brain — new canonical page. The specific production hazard Sugu calls out: MySQL semi-sync replicas apply-on-receive (no tentative state), and a restarting primary completes in-flight requests without verifying durability. Both behaviours can surface writes that were never durable to subsequent reads.
- concepts/quorum-read — new canonical page. A read that probes a majority of followers to find the latest durable value. Required for consistent reads in systems without a stable leader / lease.
- concepts/consistent-read — existing page, extended. Part 6 gives the commit-path perspective: consistent reads can be delivered either by waiting for completion on the commit-path (so quorum reads are cheap) or by taking a leader lease (so leader-local reads are cheap). The commit-path and read-path optimisations interact.
- concepts/leader-lease — existing page, extended. Part 6 gives the cross-coupling to commit-path design: a lease holder can respond to clients at the durable stage instead of waiting for completion, saving one round-trip without losing consistency.
- concepts/forward-progress — existing page, extended. Part 6's retry-not-cancel rule for a leader that can't reach durability is an additional forward-progress sub-invariant: retries must be bounded only by the leader's own lifetime, with cancellation deferred to electors during leadership changes.
- concepts/split-brain — existing page, extended. Part 6 adds the commit-path instance: a primary that completes unverified in-flight requests on restart can produce split-brain writes — distinct from the leadership-election split-brain the page already covered.
Patterns canonicalised¶
- patterns/two-phase-tentative-then-complete — new canonical page. Leader sends payload as tentative → followers ack → leader waits for durability acks → leader sends complete → followers materialise. The general consensus-commit shape that safely separates receipt from materialisation while preserving the mutual exclusion between completion and cancellation.
- patterns/skip-completion-for-late-followers — new canonical page. Once a request is durable, the leader may skip the tentative step for followers that have not yet received the message and send complete directly. Saves one round-trip per lagging follower.
- patterns/early-ack-on-durability — new canonical page. Responding to the client as soon as durability is met (not waiting for completion) cuts latency by one round-trip but forces quorum reads unless a leader lease is in place.
Operational numbers¶
No operational numbers in this post. The series is a framework / pedagogy piece at the protocol-design altitude — round-trip counts and failure-mode classifications rather than QPS / latency / durability SLOs. The 2022 publication date and Sugu's framing are both consistent with "here is the mental model, here is why it is load-bearing" rather than "here are our production numbers."
Caveats¶
- Pedagogy, not production disclosure. The post is part of a structured series aimed at re-deriving consensus primitives from first principles; it does not describe a specific Vitess deployment, a specific customer outage, or specific production numbers. Cite it for the mental model, not for deployment specifics.
- MySQL semi-sync critique is unreferenced here. Sugu points readers to his older Distributed durability in MySQL for the corner-case catalogue; that post is not in the wiki. The split-brain hazard is named but not worked in detail.
- Part 7 (request propagation) not yet ingested. The two-phase shape's interaction with leadership changes — specifically, how a new elector safely propagates tentative requests left by the old leader — is deferred to the next instalment.
- Publication date ambiguity. Raw frontmatter
published:is 2026-04-21 (re-fetch date); body byline reads June 21, 2022. Architectural content is still current because the two-phase shape is a foundational design claim, not a perishable implementation detail. Treated analogously to prior consensus-series re-fetches (part 4 2022-04-06, part 5 2022-04-28).
Source¶
- Original: https://planetscale.com/blog/consensus-algorithms-at-scale-part-6
- Raw markdown:
raw/planetscale/2026-04-21-consensus-algorithms-at-scale-part-6-completing-requests-839ff364.md
Related¶
- sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-4-establishment-and-revocation — part 4 of the series, canonicalising the revoke/establish split.
- sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-5-handling-races — part 5 of the series, canonicalising lock-based vs lock-free race resolution; precondition for why part 6 can assume a stable leader when describing the commit path.
- systems/vitess — the production context.
- systems/mysql — the target of the explicit critique of semi-sync's apply-on-receive behaviour.
- concepts/tentative-request, concepts/durable-request, concepts/two-phase-completion-protocol, concepts/request-cancellation, concepts/mysql-semi-sync-split-brain, concepts/quorum-read — concepts introduced by this ingest.
- patterns/two-phase-tentative-then-complete, patterns/skip-completion-for-late-followers, patterns/early-ack-on-durability — patterns introduced by this ingest.
- concepts/consistent-read, concepts/leader-lease, concepts/forward-progress, concepts/split-brain — existing concepts extended by this ingest.
- companies/planetscale — the publishing company.