Skip to content

CONCEPT Cited by 1 source

Semi-sync timeout fallback

Definition

Semi-sync timeout fallback is the specific MySQL behaviour in which a primary, having waited rpl_semi_sync_master_timeout milliseconds for a sufficient number of semi-sync replicas to ack a transaction and not having received them, silently falls back to asynchronous replication — commits the transaction anyway and reports success to the user, as if semi-sync had never been configured.

Shlomi Noach's framing:

"The primary waits up to rpl_semi_sync_master_timeout, after which it falls back to asynchronous replication mode, committing and responding to the user even if not all expected replicas have acknowledged receipt. In the scope of this post, we are interested in an 'infinite' timeout, which we will accept to be a very large number." (Source: sources/2026-04-21-planetscale-mysql-semi-sync-replication-durability-consistency-and-split-brains)

Why it is a hazard

The fallback is silent. The primary does not alarm, does not reject the write, does not set an application-visible flag. From the user's perspective, the commit looks identical to a healthy semi-sync commit. But the durability guarantee — "if you told me it succeeded, the data is durable elsewhere" — has evaporated. The data exists only on the primary at ack time, and a primary crash in that window loses it with no recovery path.

Every minute the system runs in fallback is a minute with semi-sync's contract silently broken.

When it fires

The timeout fires whenever the number of ack'ing semi-sync replicas fails to reach rpl_semi_sync_master_wait_for_slave_count within the window. Common triggers:

  • Replica apply-thread overload — if the replica is too busy even to write to its relay log, acks lag.
  • Network packet loss or partition — acks delayed or dropped.
  • All semi-sync replicas offline — restart, maintenance, or failure takes every ack'ing candidate out.
  • Primary I/O stall — if the primary's own relay-log flush to followers is slow, acks return slowly.

The two postures

Operators face a binary choice expressed as a tunable knob:

Timeout setting Posture Trade-off
Finite (default, usually 10s) Availability-first: commits keep returning even when acks don't arrive. Durability guarantee is best-effort; silent failure mode exists.
Effectively infinite (very large number) Durability-first: commits block indefinitely if acks don't arrive. Primary availability is gated on replica health; a sick replica can hang all writes.

Noach's discussion assumes the second posture, which is the only way to preserve the durability guarantee. Canonicalised as patterns/infinite-semi-sync-timeout.

Operational signals

If you run with a finite timeout, the relevant observability signals are:

  • rpl_semi_sync_master_no_tx / rpl_semi_sync_master_yes_tx status variables — ratio of fallback vs. semi-sync commits. Any non-zero no_tx rate is active fallback.
  • rpl_semi_sync_master_status — whether semi-sync is currently enabled (falls to OFF in some fallback modes).
  • rpl_semi_sync_master_wait_sessions — current blocked-waiter count; spikes before fallback.

The architectural alternative

The underlying problem — that durability is a hard-coded "wait for k acks" rule with no structured response to replica unavailability — is what patterns/pluggable-durability-rules is designed to generalise. A plugin could express: "if the primary's own DC is intact but a remote DC is lagging, preserve durability without blocking; if the primary is in a minority partition, block writes entirely." MySQL's semi-sync does not offer this expressiveness.

Seen in

Last updated · 378 distilled / 1,213 read