Skip to content

PATTERN Cited by 1 source

Two-level regional/global state

Problem

A single global state-distribution cluster covering every region at once has two failure properties that get worse with scale:

  1. Unbounded blast radius — a bug in the producer, consumer, or schema propagates to every region simultaneously.
  2. Cross-region coupling — per-region changes trigger fleet-wide reconciliation even though they only matter locally.

At Fly.io, the 2024-09-01 global Anycast outage (contagious deadlock in fly-proxy) was the forcing function: a single bad update reached every proxy everywhere in milliseconds.

Pattern

Split the state into two tiers:

  • Per-region clusters — each region runs its own state-distribution cluster with fine-grained data about the local fleet. This is the high-cardinality / high-churn tier.
  • Global cluster — a smaller cluster that only carries cross-region-required coarse-grained data. For Fly.io's Anycast edge: "which regions run this app?" is enough to make edge forwarding decisions.

Consumers read their regional cluster for local details and the global cluster for cross-region routing.

Why it works

  • Blast radius bounded to a region for any bug in region-local code or data.
  • Cross-region traffic shrinks — most state changes stay in their region, reducing wire volume on inter-region links.
  • Rollouts are region-scoped — deploy a change to one region at a time; worst-case is one region down.
  • No coupling between regions — an incident in Tokyo doesn't touch Sydney's replica clusters.

The trade-off is coordination complexity between the two tiers: the regional clusters must periodically publish coarse-grained state to the global cluster, and the global cluster's schema becomes a cross-region contract.

Canonical wiki instance — Fly.io regionalization

From sources/2025-10-22-flyio-corrosion:

"After the contagious deadlock bug, we concluded we need to evolve past a single cluster. So we took on a project we call 'regionalization', which creates a two-level database scheme. Each region we operate in runs a Corrosion cluster with fine-grained data about every Fly Machine in the region. The global cluster then maps applications to regions, which is sufficient to make forwarding decisions at our edge proxies."

Crucial detail: "Nothing about Corrosion's design required us to [run a single global cluster]." The single domain was an operational default, not a protocol requirement — regionalization is entirely a deployment-shape change, not a new system.

Caveats

  • Schema discipline across tiers — any field that might ever be needed cross-region must be routed through the global tier; getting this wrong leaks regional state into the global cluster or forces costly migrations later.
  • Consistency seam — the regional-to-global publish path is eventually consistent. Systems depending on strict cross-region invariants need additional coordination.
  • Operational complexity — more clusters to monitor, upgrade, patch.
  • Routing / discovery — consumers need to know which regional cluster to consult; this adds a layer to the discovery mechanism.

Generalisation

The pattern applies broadly beyond Fly.io:

  • CDNs — per-PoP config stores with a global app → PoP mapping.
  • Service meshes — regional xDS control planes federated through a global gateway.
  • Feature flags / config — regional stores for high-churn flags, global store for cross-region-sensitive ones.
  • Observability backends — regional trace / metric stores with cross-region query federation.

Seen in

Last updated · 200 distilled / 1,178 read