CONCEPT Cited by 2 sources
Split-brain¶
Split-brain is the failure mode in which two (or more) nodes each believe they have authoritative ownership of the same resource — typically a key, partition, or leadership role. Or the dual: no node thinks it owns the resource, and requests drop silently. Both shapes appear under partition/crash/intermittent-failure scenarios when ownership is decided without a coordinator.
Where it shows up in sharding¶
Static-sharding schemes compute key → node independently in each client (concepts/static-sharding). When pods crash or become intermittently unresponsive, clients can arrive at inconsistent views of membership:
- Two pods own the same key → writes may conflict or be silently lost; cache coherence breaks.
- No pod owns the key → customer traffic is dropped entirely.
Static consistent-hashing has no pathway to prevent either outcome because there's no authoritative record of "who owns what right now" — just whichever clients each client sees as alive.
Mitigations¶
- Central coordinator (the dynamic-sharding posture of systems/dicer / systems/slicer / systems/shard-manager): assignment is state published by one authority; clients converge on it.
- Leases (systems/centrifuge / systems/slicer): ownership is a time-bounded lease issued by the coordinator; a pod without a valid lease does not serve.
- Consensus-based leadership (Raft / Paxos / ZooKeeper): strongest form; the overhead that "soft" leader election (concepts/soft-leader-election) deliberately avoids.
Trade-off with eventual consistency¶
systems/dicer chose concepts/eventual-consistency of its Assignment publication: pods and clients may briefly hold different views of ownership during transitions. The system is designed so such transient disagreement doesn't corrupt state — but it explicitly does not provide the exclusive-ownership guarantee that leases / consensus would. Applications that need that guarantee layer it on top (or use a different system).
Seen in¶
-
sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-7-propagating-requests — canonical wiki framing that anti-flapping rules are the operational mitigation that makes MySQL-at-massive-scale split-brain-free. Sugu Sougoumarane: "This is the reason why organizations have been able to avoid split-brain scenarios while running MySQL at a massive scale." The substrate is Orchestrator (or Vitess's VTOrc fork) + MySQL binlog GTID + timestamp metadata + anti-flapping rate-limit on leadership changes — an empirical rather than formal correctness argument, but validated at large scale.
-
sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder — split-brain named as one of the three structural failure modes of static sharding that motivated Dicer.
- sources/2026-04-21-planetscale-consensus-algorithms-at-scale-part-6-completing-requests — commit-path instance: a MySQL primary that completes unverified in-flight requests on restart can produce semi-sync split-brain — distinct from the leadership-election split-brain. The two-phase completion protocol is the generic commit-path shape that prevents this by requiring explicit durability re-check before completing tentative records on restart.