SYSTEM Cited by 4 sources
Corrosion / corrosion2 (Fly SWIM-gossip CRDT-SQLite database)¶
Corrosion (and its successor corrosion2) is Fly.io's
state-distribution system — a Rust service that does
SWIM gossip to propagate Machine / routing state across
Fly's global worker fleet, persisting each node's view in a
CRDT-structured SQLite database. Every
component on the Fly fleet (fly-proxy,
others) can run local SQLite queries to get near-real-time
information about any Fly Machine
around the world.
Open-source repo: superfly/corrosion.
Invariant: worker = source of truth¶
Pre-migration, Corrosion relied on an invariant: "workers are the source of truth for information about the Fly Machines running on them." Each worker's Corrosion node writes Machine records for the Machines it hosts; the rest of the fleet subscribes.
Migration breaks the invariant¶
From Making Machines Move:
Migration knocks the legs out from under that constraint, which we were relying on in Corrosion, the SWIM-gossip SQLite database we use to connect Fly Machines to our request routing. Race conditions. Debugging. Design changes.
Fly flags this as a teaser: "Corrosion deserves its own post." Without that post, we know only that migration forces Corrosion to cope with a Machine whose location is transiently ambiguous (source worker still alive with the original Machine being killed, target worker booting the clone) and that design changes were required.
Seen in¶
- sources/2024-07-30-flyio-making-machines-move — Named as a "gnarlier example" of migration complications. Anchor source pending a dedicated Fly blog post on Corrosion.
- sources/2025-02-12-flyio-the-exit-interview-jp-phillips —
JP Phillips's "most impressive thing someone else built
here", naming
corrosion2as the v2 redesign ("we deployed corrosion, learned from it, and were able to make significant and valuable improvements — and then migrate to the new system in a short period of time") and giving the first wiki-quality one-paragraph architectural sketch:corrosion2is our state distribution system. While flyd runs individual Fly Machines for users, each instance is solely responsible for its own state; there's no global scheduler. But we have platform components, most obviously fly-proxy, our Anycast router, that need to know what's running where.corrosion2is a Rust service that does SWIM gossip to propagate information from each worker into a CRDT-structured SQLite database.corrosion2essentially means any component on our fleet can do SQLite queries to get near-real-time information about any Fly Machine around the world.
JP's framing also surfaces the external-adoption gap:
If we invested in Antithesis or TLA+ testing, I think there's potential for other companies to get value out of corrosion2.
That's a canonical wiki callout for
formal-methods /
deterministic-simulation validation as the gate between
"works at Fly.io scale" and "safe for external
production." JP's engineering framing: "Having a 'just
SQLite' interface, for async replicated changes around the
world in seconds, it's pretty powerful."
- sources/2025-05-28-flyio-parking-lot-ffffffffffffffff —
Architectural role clarified as the RIB in a RIB/FIB
pairing with fly-proxy's in-memory
Catalog (the FIB): "In somewhat the same sense as a
router works both with a RIB and a FIB, there is in
fly-proxy a system of record for routing information
(Corrosion), and then an in-memory aggregation of that
information used to make fast decisions." Update
propagation latency bound: "millisecond intervals of
time" host-to-host. The 2024 global Anycast outage was
triggered when a Corrosion update about an app nobody
used propagated fleet-wide and hit an
if let lock-scope bug
in fly-proxy's Catalog reader — canonical blast-radius
instance motivating fly-proxy's regionalization. Corrosion
itself is not at fault in that outage (the bug was in the
consumer); the post uses Corrosion as context for the
anycast-state-distribution problem shape.
- sources/2025-10-22-flyio-corrosion — canonical primary
source, the dedicated Corrosion deep-dive Fly.io had been
promising for over a year ("Corrosion deserves its own
post"). Fills in the previously-missing architectural
detail + disclosure of the outage catalogue + the
regionalization response. New disclosure summary:
- Protocol inspiration: OSPF
(concepts/link-state-routing-protocol) — "routers are
sources of truth for their own links and responsible for
quickly communicating changes to every other router, so
the network can make forwarding decisions." Fly's
global fully-connected WireGuard mesh means OSPF's
connectivity-bootstrap is free; "all we need to do is
gossip efficiently."
- Stack: SWIM membership +
QUIC for broadcast/reconciliation + systems/cr-sqlite
("the CRDT SQLite extension") for conflict resolution.
Changes logged to crsql_changes table, applied
last-write-wins by logical
timestamp (causal not wall-clock). cr-sqlite CRDT
design: superfly/corrosion — crdts.md.
- No distributed consensus, deliberately. Canonical
wiki anchor for
the "face-seeking rake" framing: "truly global
distributed consensus promises deliciousness while
yielding only immolation. Consensus protocols like Raft
break down over long distances." Rejected alternatives
named: Consul ("don't build a global
routing system on it"), Zookeeper, etcd, Raft,
rqlite ("came very close to
using"), FoundationDB, S3-backed stores.
- Outage catalogue disclosed (three):
1. 2024-09-01 contagious deadlock — the parking_lot
RwLock double-free wake-up bug in
fly-proxy's Catalog, triggered
by a Corrosion update about an app nobody used. "The
worst outage we've experienced." Canonical
concepts/contagious-deadlock instance. Corrosion
was a bystander.
2. Nullable-column DDL apocalypse — a trivial schema
change to a CRDT table forced cr-sqlite to backfill
every row, gossiping fleet-wide simultaneously.
Canonical concepts/nullable-column-backfill-amplification.
3. Consul mTLS cert expiry — severed Consul; every
worker's backoff loop retried Consul, each retry
re-invoked a Machine-state code path that wrote to
Corrosion. Fleet-wide
uplink saturation. Fly.io "apologizes to our uplink
providers."
- Mitigations / iteration: (i) Tokio watchdogs on every
service (patterns/watchdog-bounce-on-deadlock);
(ii) production adoption of
Antithesis
— "killer for distributed systems" — first-person
confirmation of the investment JP Phillips flagged as
the external-adoption gate; (iii) checkpoint backups on
object storage (patterns/checkpoint-backup-to-object-storage)
— used "ultimately" to reboot the cluster from
snapshot when diagnosis took longer than restore;
(iv) eliminated partial updates —
whole-object republish
with no-op filtering at the CRDT layer — "we should have
done it this way to begin with";
(v) regionalization project
(patterns/two-level-regional-global-state,
concepts/regionalization-blast-radius-reduction) —
per-region clusters + small global cluster mapping apps
to regions, in-progress at time of publication.
- Scope discipline: "not every piece of state we
manage needs gossip propagation" — systems/tkdb
(Macaroon authority) + systems/petsem (Vault
replacement) run on systems/litefs/systems/litestream,
not Corrosion. Cross-reference with
sources/2025-03-27-flyio-operationalizing-macaroons.
- Open source: github.com/superfly/corrosion.
Authored by Jérôme Gravel-Niquet; iteration led by
Somtochi Onyekwere and Peter Cai.
Complementary to flyd's storage choice¶
flyd uses BoltDB (key-value, no SQL) for authoritative per-worker state; corrosion2 uses CRDT-SQLite for read-side state distribution to platform consumers. Same Fly.io stack, deliberately different query surfaces. See concepts/bolt-vs-sqlite-storage-choice for the decomposition — corrosion2 is the canonical wiki instance of the SQL-on-the-read-side pick.
Caveats / open questions¶
- SWIM parameters (fanout, heartbeat interval, indirect-ping neighbor count) + QUIC layering + cr-sqlite replication topology still undisclosed at mechanism level — referenced by name, not specified.
- How Corrosion integrates with flyd's BoltDB FSMs is not covered in any source.
- Regionalization is in-progress — no region count, no rollout timeline, no pre/post blast-radius numbers disclosed.
- Corrosion vs corrosion2 version mechanics remain fuzzy
across sources (2025-02-12 named
corrosion2as a separate v2 redesign; 2025-10-22 uses just "Corrosion"). - CRDT-family limitations on unique constraints, referential integrity, schema migration primitives are not systematically discussed.
Related¶
- systems/flyd — The orchestrator whose FSMs write Machine state; Corrosion presumably consumes flyd's Machine events.
- systems/fly-proxy — Primary consumer of Corrosion-state for request routing; FIB to Corrosion's RIB.
- systems/cr-sqlite — The CRDT SQLite extension under Corrosion.
- systems/consul — The rejected predecessor.
- concepts/gossip-protocol
- concepts/crdt
- concepts/last-write-wins
- concepts/link-state-routing-protocol
- concepts/rib-fib-routing
- concepts/no-distributed-consensus
- concepts/contagious-deadlock
- concepts/nullable-column-backfill-amplification
- concepts/uplink-saturation-from-backoff
- concepts/regionalization-blast-radius-reduction
- patterns/watchdog-bounce-on-deadlock
- patterns/antithesis-multiverse-debugging
- patterns/checkpoint-backup-to-object-storage
- patterns/eliminate-partial-updates
- patterns/two-level-regional-global-state
- patterns/crdt-over-raft-for-wan-state-distribution