CONCEPT Cited by 1 source
Regionalization blast-radius reduction¶
What it is¶
Regionalization is the structural response to a single global state-distribution cluster whose blast radius has proven unacceptable: split the state into a per-region cluster (with fine-grained data local to that region) plus a smaller global cluster that only propagates cross-region decisions. Most state changes now matter only inside their region and can never reach other regions — the blast radius of any bug in the region-local state is bounded to that region.
Mechanism¶
- Identify what's actually cross-region. For Fly.io's Anycast edge, only "which regions run this app?" is globally required. Fine-grained Machine-level state is region-local: an edge proxy in Tokyo doesn't need to know the exact health of a Machine in Amsterdam, just that Amsterdam is a valid region for the app.
- Run one state-distribution cluster per region. Each region has its own Corrosion cluster with fine-grained Machine data.
- Run a smaller global cluster that maps applications → regions. Cross-region decisions (e.g. anycast routing) consult the global cluster; the regional clusters feed the global cluster with coarse-grained state.
- Most code changes are region-local. Regionalization allows rolling new code into a single region with bounded worst-case impact — a bug only melts one regional cluster, not the entire fleet.
Why it's a payoff¶
The 2024-09-01 global Anycast outage (contagious deadlock) was caused by a bug in the consumer of global Corrosion state. Because Corrosion was one cluster covering the whole fleet, every proxy received the deadlock-triggering update and every proxy deadlocked.
With regionalization:
- The buggy update would only propagate within its region.
- Other regions remain healthy and continue serving traffic (Anycast naturally fails over to healthy regions).
- The incident's scope is "one region down for a few minutes" rather than "global outage for tens of minutes."
Fly.io's explicit framing¶
From sources/2025-10-22-flyio-corrosion:
"After the contagious deadlock bug, we concluded we need to evolve past a single cluster. So we took on a project we call 'regionalization', which creates a two-level database scheme. Each region we operate in runs a Corrosion cluster with fine-grained data about every Fly Machine in the region. The global cluster then maps applications to regions, which is sufficient to make forwarding decisions at our edge proxies. Regionalization reduces the blast radius of state bugs."
Crucially: "Nothing about Corrosion's design required us to do this." The single global domain was an operational default, not a protocol constraint. Regionalization is entirely a deployment-shape change.
Generalisation¶
The pattern generalises beyond Fly.io: any state-distribution system that started with a single global cluster and grew large enough to have incidents should consider regionalization as an architectural lever:
- Service meshes — regional control planes with a thin global federation.
- CDN configuration distribution — per-PoP config stores with a global name → PoP-list mapping.
- Feature flag stores — regional stores with cross-region-only flags federated.
See patterns/two-level-regional-global-state for the canonical pattern page.
Seen in¶
- sources/2025-10-22-flyio-corrosion — canonical primary source. "Regionalization reduces the blast radius of state bugs."
Related¶
- concepts/blast-radius — the concept regionalization operationalizes.
- concepts/contagious-deadlock — the 2024-09-01 failure mode that motivated the project.
- patterns/two-level-regional-global-state — the canonical pattern page.
- systems/corrosion-swim — the state-distribution system being regionalized.