Skip to content

SYSTEM Cited by 4 sources

Corrosion / corrosion2 (Fly SWIM-gossip CRDT-SQLite database)

Corrosion (and its successor corrosion2) is Fly.io's state-distribution system — a Rust service that does SWIM gossip to propagate Machine / routing state across Fly's global worker fleet, persisting each node's view in a CRDT-structured SQLite database. Every component on the Fly fleet (fly-proxy, others) can run local SQLite queries to get near-real-time information about any Fly Machine around the world.

Open-source repo: superfly/corrosion.

Invariant: worker = source of truth

Pre-migration, Corrosion relied on an invariant: "workers are the source of truth for information about the Fly Machines running on them." Each worker's Corrosion node writes Machine records for the Machines it hosts; the rest of the fleet subscribes.

Migration breaks the invariant

From Making Machines Move:

Migration knocks the legs out from under that constraint, which we were relying on in Corrosion, the SWIM-gossip SQLite database we use to connect Fly Machines to our request routing. Race conditions. Debugging. Design changes.

Fly flags this as a teaser: "Corrosion deserves its own post." Without that post, we know only that migration forces Corrosion to cope with a Machine whose location is transiently ambiguous (source worker still alive with the original Machine being killed, target worker booting the clone) and that design changes were required.

Seen in

  • sources/2024-07-30-flyio-making-machines-move — Named as a "gnarlier example" of migration complications. Anchor source pending a dedicated Fly blog post on Corrosion.
  • sources/2025-02-12-flyio-the-exit-interview-jp-phillipsJP Phillips's "most impressive thing someone else built here", naming corrosion2 as the v2 redesign ("we deployed corrosion, learned from it, and were able to make significant and valuable improvements — and then migrate to the new system in a short period of time") and giving the first wiki-quality one-paragraph architectural sketch:

    corrosion2 is our state distribution system. While flyd runs individual Fly Machines for users, each instance is solely responsible for its own state; there's no global scheduler. But we have platform components, most obviously fly-proxy, our Anycast router, that need to know what's running where. corrosion2 is a Rust service that does SWIM gossip to propagate information from each worker into a CRDT-structured SQLite database. corrosion2 essentially means any component on our fleet can do SQLite queries to get near-real-time information about any Fly Machine around the world.

JP's framing also surfaces the external-adoption gap:

If we invested in Antithesis or TLA+ testing, I think there's potential for other companies to get value out of corrosion2.

That's a canonical wiki callout for formal-methods / deterministic-simulation validation as the gate between "works at Fly.io scale" and "safe for external production." JP's engineering framing: "Having a 'just SQLite' interface, for async replicated changes around the world in seconds, it's pretty powerful." - sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Architectural role clarified as the RIB in a RIB/FIB pairing with fly-proxy's in-memory Catalog (the FIB): "In somewhat the same sense as a router works both with a RIB and a FIB, there is in fly-proxy a system of record for routing information (Corrosion), and then an in-memory aggregation of that information used to make fast decisions." Update propagation latency bound: "millisecond intervals of time" host-to-host. The 2024 global Anycast outage was triggered when a Corrosion update about an app nobody used propagated fleet-wide and hit an if let lock-scope bug in fly-proxy's Catalog reader — canonical blast-radius instance motivating fly-proxy's regionalization. Corrosion itself is not at fault in that outage (the bug was in the consumer); the post uses Corrosion as context for the anycast-state-distribution problem shape. - sources/2025-10-22-flyio-corrosioncanonical primary source, the dedicated Corrosion deep-dive Fly.io had been promising for over a year ("Corrosion deserves its own post"). Fills in the previously-missing architectural detail + disclosure of the outage catalogue + the regionalization response. New disclosure summary: - Protocol inspiration: OSPF (concepts/link-state-routing-protocol) — "routers are sources of truth for their own links and responsible for quickly communicating changes to every other router, so the network can make forwarding decisions." Fly's global fully-connected WireGuard mesh means OSPF's connectivity-bootstrap is free; "all we need to do is gossip efficiently." - Stack: SWIM membership + QUIC for broadcast/reconciliation + systems/cr-sqlite ("the CRDT SQLite extension") for conflict resolution. Changes logged to crsql_changes table, applied last-write-wins by logical timestamp (causal not wall-clock). cr-sqlite CRDT design: superfly/corrosion — crdts.md. - No distributed consensus, deliberately. Canonical wiki anchor for the "face-seeking rake" framing: "truly global distributed consensus promises deliciousness while yielding only immolation. Consensus protocols like Raft break down over long distances." Rejected alternatives named: Consul ("don't build a global routing system on it"), Zookeeper, etcd, Raft, rqlite ("came very close to using"), FoundationDB, S3-backed stores. - Outage catalogue disclosed (three): 1. 2024-09-01 contagious deadlock — the parking_lot RwLock double-free wake-up bug in fly-proxy's Catalog, triggered by a Corrosion update about an app nobody used. "The worst outage we've experienced." Canonical concepts/contagious-deadlock instance. Corrosion was a bystander. 2. Nullable-column DDL apocalypse — a trivial schema change to a CRDT table forced cr-sqlite to backfill every row, gossiping fleet-wide simultaneously. Canonical concepts/nullable-column-backfill-amplification. 3. Consul mTLS cert expiry — severed Consul; every worker's backoff loop retried Consul, each retry re-invoked a Machine-state code path that wrote to Corrosion. Fleet-wide uplink saturation. Fly.io "apologizes to our uplink providers." - Mitigations / iteration: (i) Tokio watchdogs on every service (patterns/watchdog-bounce-on-deadlock); (ii) production adoption of Antithesis"killer for distributed systems" — first-person confirmation of the investment JP Phillips flagged as the external-adoption gate; (iii) checkpoint backups on object storage (patterns/checkpoint-backup-to-object-storage) — used "ultimately" to reboot the cluster from snapshot when diagnosis took longer than restore; (iv) eliminated partial updates — whole-object republish with no-op filtering at the CRDT layer — "we should have done it this way to begin with"; (v) regionalization project (patterns/two-level-regional-global-state, concepts/regionalization-blast-radius-reduction) — per-region clusters + small global cluster mapping apps to regions, in-progress at time of publication. - Scope discipline: "not every piece of state we manage needs gossip propagation"systems/tkdb (Macaroon authority) + systems/petsem (Vault replacement) run on systems/litefs/systems/litestream, not Corrosion. Cross-reference with sources/2025-03-27-flyio-operationalizing-macaroons. - Open source: github.com/superfly/corrosion. Authored by Jérôme Gravel-Niquet; iteration led by Somtochi Onyekwere and Peter Cai.

Complementary to flyd's storage choice

flyd uses BoltDB (key-value, no SQL) for authoritative per-worker state; corrosion2 uses CRDT-SQLite for read-side state distribution to platform consumers. Same Fly.io stack, deliberately different query surfaces. See concepts/bolt-vs-sqlite-storage-choice for the decomposition — corrosion2 is the canonical wiki instance of the SQL-on-the-read-side pick.

Caveats / open questions

  • SWIM parameters (fanout, heartbeat interval, indirect-ping neighbor count) + QUIC layering + cr-sqlite replication topology still undisclosed at mechanism level — referenced by name, not specified.
  • How Corrosion integrates with flyd's BoltDB FSMs is not covered in any source.
  • Regionalization is in-progress — no region count, no rollout timeline, no pre/post blast-radius numbers disclosed.
  • Corrosion vs corrosion2 version mechanics remain fuzzy across sources (2025-02-12 named corrosion2 as a separate v2 redesign; 2025-10-22 uses just "Corrosion").
  • CRDT-family limitations on unique constraints, referential integrity, schema migration primitives are not systematically discussed.
Last updated · 200 distilled / 1,178 read