CONCEPT

Mesh latency health-check¶

Definition¶

Mesh latency health-check is the pattern of having every edge node in a POP mesh continuously ping every other edge node as part of routine health-checking, recording peer-to-peer latency measurements in a locally-held table, and using that table to sort each routing record's candidate list by measured latency so the next-hop decision always takes the lowest-latency peer.

Health-checking and latency-measurement are fused into a single mechanism: the existing warm connections kept open for query backhaul double as measurement probes, and the routing table's ordering is a live function of the measurement table.

Canonical framing from PlanetScale's 2024-04-17 Global Network launch post: "we maintain warm connections between all of our regions ready to go, we utilize these to measure latency continuously as a part of regular health checking. So, for example, the us-east-1 edge node is continuously pinging its peers, similar to a mesh network and measuring their latency. Once a Route is seen over the etcd watcher, before it's accessible to being used, we are able to simply sort the list of clusters based on their latency times we already are tracking. We periodically re-sort every Route if/when latency values change. This keeps the 'next hop' decision always clusters[0] in practice." (Source: .)

Structural properties¶

Continuous, not on-demand. Latency is always-on measured in the background. A routing decision never pays for a measurement; it pays for a lookup in the pre-computed table.
Fused with backhaul. The warm connection used to measure latency is the same connection used to tunnel queries. No separate measurement fleet, no separate probe protocol. See patterns/warm-mesh-connection-pool.
Latency-sorted data structure. Routing records (e.g. PlanetScale's Route.cluster list) are sorted in-place by measured latency. Next-hop = clusters[0]. Failover to clusters[1], clusters[2] is an implicit property of the sort: an unreachable peer reports infinite latency and drops to the bottom.
Re-sorted on two triggers. (a) New routing record arrives via control-plane watch (e.g. etcd mutation); sort before marking the record available. (b) Measured latency values drift; re-sort existing records periodically.
Unifies failover and latency-optimization. Traditional systems have separate "is peer up?" health-checks and "which peer is fastest?" ranking. Mesh latency health-check collapses them: the same measurement drives both decisions.

Scaling¶

Continuous all-pairs pinging is O(N²) in edge-node count; for small-to-mid-sized POP meshes (tens of POPs) this is tractable. Larger meshes typically need:

Sampled pings rather than full pings.
Hierarchical measurement (region-to-region rather than node-to-node, then intra-region short probes).
Piggyback on existing traffic (use query-reply timings as latency samples rather than dedicated pings).

PlanetScale's published architecture doesn't specify which of these scaling techniques they apply.

Failover semantics¶

When clusters[0] fails:

The next ping times out → peer's measurement is set to "infinite" (or the health-check-timeout ceiling).
Next re-sort moves the dead peer to the end of the list.
Next query from the edge session goes to the new clusters[0].

No separate failover protocol is required; the sort handles it.

Seen in¶

— canonical wiki disclosure. PlanetScale Global Network uses continuous peer-to-peer pinging across warmed mesh connections, feeding an always-sorted Route.cluster list at every edge POP. Adding or removing a read-only region from a customer's cluster triggers an etcd mutation → watcher fires → the edge node sorts the new cluster list against the already-tracked latency table before making the Route available. Failover for a dead region is framed as the natural consequence of the sort rather than as a separate protocol: "we could go over to the next option if there were multiple choices."