Skip to content

CONCEPT Cited by 1 source

Partial network partition

Definition

A partial network partition is a connectivity failure in which a node can reach some of its network peers but not others, producing a split fabric rather than a clean on/off break. Unlike a classical Jepsen-style partition (the pathological two-subset split), partial partitions leave nodes individually reachable along one path while silently blocking a different path.

The canonical signature is a single node showing multiple different reachability verdicts for different peers, or a single peer-pair showing different verdicts along different transports.

Three manifestations from PlanetScale 2025-10-20

The 2025-10-20 incident documents three partial-partition shapes in a single paragraph:

  1. Internet reachable + cross-AZ unreachable"some database servers were reachable from the Internet but couldn't communicate across availability zones for query routing, replication, or both." The public-internet fabric works, the inter-AZ private fabric doesn't.
  2. Container registry reachable + primary unreachable"some replicas could reach container registries when they started up but could not replicate from their primary MySQL or Postgres." Two private-network destinations, two different verdicts.
  3. Internal DNS split"some servers had trouble resolving internal DNS names and others had trouble connecting to the internal services those DNS names resolved." DNS works for some paths, not others; even where DNS returns a valid answer, the downstream connection to the resolved target may still fail.

Verbatim framing:

The network partitions caused a significant percentage of some customers' queries to fail. Not all database branches were affected as the impact depended heavily on which availability zones were in use and whether traffic was crossing between zones.

Why they are harder to reason about than total partitions

  • Quorum protocols assume symmetric reachability. A 3-AZ quorum like minimum 2 replicas across 3 AZs tolerates one AZ going fully offline because the remaining two AZs can form a 2-of-3 majority. If the remaining two AZs cannot reach each other, the tolerance guarantee breaks. PlanetScale's closing observation: "the use of three availability zones allows us to tolerate the failure of one but only if network connectivity between the other two remains reliable."
  • Health checks are transport-specific. An HTTP-healthcheck over the public fabric says "up" while the cross-AZ replication path is completely broken. Operators need per-path health signals to diagnose.
  • Stuck connections persist after healing. Surviving a partition is not the same as recovering from one — some processes acquire state during a partition (stale DNS caches, broken TCP connections, confused leader-election state) that only a restart clears. PlanetScale: "Once the network partitions healed, we found a small number of processes (PlanetScale's edge load balancer as well as vtgate) which were not able to recover on their own."

Operational response

The response patterns that worked in the 2025-10-20 incident generalise to any partial-partition event:

  • Manual reparenting to healthier AZszonal reparenting: "we manually sent reparent requests to move primary databases to availability zones known to be healthier or known to be colocated with the customer's application." Operator-driven because automatic detection of which AZ is healthier requires per-path visibility most health systems don't have.
  • Restart stuck processes after partition heals. Processes that don't self-recover need explicit restart; this is a stuck-connection problem distinct from a split-brain disagreement problem.
  • AZ-topology redesign — use more AZs when available so the probability of a majority-side network partition shrinks. us-east-1's six-AZ topology is named explicitly in the 2025-10-20 post as a future-direction lever.

Contrast with classical network partition

  • Classical partition (Jepsen / CAP-theorem framing): two subsets, no cross-subset traffic. Each subset sees the other as down. Quorum protocols on the majority side continue.
  • Partial partition: per-path asymmetry. A node can be up from one vantage point and down from another simultaneously. Quorum protocols get ambiguous votes. Leader election sees conflicting evidence.
  • Gray failure (the closely related literature term): any failure where the system's view of health disagrees with the actual health of a component. Partial partition is one concrete realisation of gray failure.

Seen in

  • sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20 — PlanetScale, Richard Crowley, 2025-11-03. Canonical wiki entry. Phase 2 of the 2025-10-20 AWS us-east-1 incident, ~14:30–19:30 UTC (~5 hours); three manifestations in a single paragraph (Internet-vs-cross-AZ, container-registry- vs-primary, internal-DNS split). Operator response: manual zonal reparenting + post-heal restart of stuck edge LB / vtgate processes. Closing framing: "Network partitions are one of the hardest failure modes to reason about, test, and tolerate."
Last updated · 550 distilled / 1,221 read