PATTERN Cited by 1 source
Single-ack completion with wider election reach¶
Pattern¶
Configure a consensus-backed system so the request path ack's the client after a single replica ack (k = 1) while the election path scans all possible nodes that could have acknowledged the last transaction. The two paths are tuned on opposite ends of the durability predicate axis: the request path minimises tail latency by waiting for the fastest replica; the election path absorbs the correctness cost by scanning widely and consciously bounding the scan to avoid unbounded search.
The pattern is the extreme application of common-case-frequency-asymmetry to consensus: the common-path (per-second) wins everything; the rare-path (per-day) pays the compensating cost.
Canonical statement — YouTube production¶
Sugu Sougoumarane's Part 3 disclosure of the YouTube production shape:
"At YouTube, although the quorum size was big, a single ack from a replica was sufficient for a request to be deemed completed. On the other hand, the leader election process had to chase down all possible nodes that could have acknowledged the last transaction. We did consciously trade off on the number of ackers to avoid going on a total wild goose chase." (Source: sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-3-use-cases)
Four decisions bundled into one disclosure:
- Single-ack completion was the operational default on the request path, despite the replica set being "big".
- Election ran wider than the ack predicate — "chase down all possible nodes that could have acknowledged."
- Election-scan width was bounded by design — the "conscious trade-off on the number of ackers" clause acknowledges that unbounded election search (every node that was ever in the cluster) would have been a "total wild goose chase".
- The trade-off is explicit: request throughput + tail latency vs election-time completeness. YouTube optimised for request-side and accepted the wider-election cost.
How it works under intersecting quorums¶
k = 1 durability + "scan every node" election is a valid intersecting-quorum pair:
- Write set W: any single node that acks (size 1).
- Read/election set R: every node in the cluster (size N).
- Intersection: W ⊆ R trivially. The intersection on ≥ 1 node is guaranteed.
This is correct by construction as long as the election actually reaches every potential ack-recipient. The bounded-search knob is where the engineering gets interesting: in a very-long-lived cluster with node churn, "every node that could have ack'd" includes nodes that have since left. Those nodes have to be enumerated — from a Vitess/MySQL topology service, a replica list, a historical GTID-source table — and each one probed during election.
The "conscious trade-off" on scan width¶
YouTube's "consciously trade off on the number of ackers to avoid going on a total wild goose chase" is the load-bearing operational clause. Three constraints on scan width:
- Correctness floor: must reach every node that could hold a durable write. This is the intersecting-quorum safety condition.
- Bounded operational cost: must complete in bounded time. If the scan sweeps 1000s of node entries every election, election becomes operationally infeasible.
- Node-churn aware: must correctly handle nodes that have left the cluster. Typically via a topology service (etcd, Vitess topo) that maintains a current "potential ack-recipients" set.
The "bound" is not specified numerically in Part 3. In practice at YouTube scale, it would be the shard's current replica set (not every node ever) plus whatever recently-departed nodes' state hasn't been garbage-collected from the topology.
When to use this pattern¶
Common case — high-throughput OLTP sharded database:
- Thousands of requests per second per shard.
- Single-ack durability is enough for the deployment's failure-tolerance envelope (e.g., "one node at a time" + cross-cell ackers).
- Elections are daily or less (common in sharded production).
- Tail latency matters — hundreds of shards × (slow-of-k) effect compounds.
YouTube's original MySQL-backed video-metadata shard fits this shape. So does the general sharded-vttablet production topology Vitess inherits.
Anti-use case — strict-durability workloads:
- Financial ledgers, audit trails, regulatory-recorded transactions. Single-ack durability is not enough; the business cost of losing the single-ack-in-flight failure mode is higher than the tail-latency benefit. Use k = 2 or higher with cross-boundary predicate tightening instead.
Trade-offs explicitly accepted¶
- Single-ack failure mode: leader crashes before the ack-recipient propagates, recipient also crashes. The request is lost. This is the failure sequence Part 3 enumerates; YouTube's deployment accepted its rarity.
- Wider election cost: election takes longer than majority-quorum election. Operationally: most elections are daily or rarer; the cost is amortised. But a burst of elections during a regional incident could become costly.
- Scan completeness depends on topology service: if the topology service loses state about which nodes were ever ackers, election may fail safety. Strong operational dependency on topology-service durability.
Relationship to Aurora 4-of-6¶
Aurora's 4-of-6 write / 3-of-6 read is a different intersecting-quorum point — higher durability (k = 4), lower election reach (3 of 6 suffices because 4 + 3 > 6). YouTube's k = 1 + scan-everything is at the opposite extreme of the same design space. Both are valid intersecting-quorum instances; they optimise for different points.
Relationship to MySQL semi-sync¶
MySQL semi-sync with rpl_semi_sync_master_wait_for_slave_count = 1 is structurally similar on the request side (single-ack durability). The difference is that semi-sync's election path doesn't pair-match — native MySQL failover doesn't automatically scan every potential ack-recipient before promoting a new primary, so semi-sync on a partition-prone topology admits minority-quorum writeability / split-brain. The pattern here requires the election path to actively scan; naked semi-sync does not, which is why Noach's critique of semi-sync carries. YouTube's system composed single-ack durability with a custom wide-scan election path; MySQL semi-sync alone does not.
Seen in¶
- sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-3-use-cases — canonical wiki introduction of the YouTube production instance. The "conscious trade-off" clause is the rare first-person disclosure that election-time scan width was bounded by design rather than unbounded.
Related¶
- concepts/durability-as-use-case-dependent — the framing that legitimises k = 1 durability as a design choice.
- concepts/intersecting-quorums — the arithmetic that makes k = 1 write + full-scan election safe.
- concepts/durable-request — what a single-ack request becomes after the ack arrives.
- concepts/failure-tolerance-envelope — the envelope within which single-ack durability is sufficient.
- patterns/pluggable-durability-rules — the broader architectural pattern this is an extreme instance of.
- patterns/optimize-for-common-case-frequency-asymmetry — the general principle; this pattern is its maximum-asymmetry application.
- systems/vitess — canonical wiki instance; Vitess inherits the YouTube lineage.
- systems/mysql — replication substrate; MySQL semi-sync alone does not implement the pattern (needs a custom wide-scan election path).