Skip to content

FLYIO 2025-10-22 Tier 3

Read original ↗

Fly.io — Corrosion

Summary

Fly.io's canonical introduction post for Corrosion — the Rust-written state-distribution system that propagates a SQLite database across Fly.io's global worker fleet via a gossip protocol. Open-sourced at superfly/corrosion. Post is framed around three hard-earned outage stories: the 2024-09-01 contagious-deadlock that took down Anycast globally; a nullable-column DDL on a large CRDT table that melted the cluster by triggering fleet-wide backfill; and a Consul mTLS certificate expiry whose backoff-loop side effects saturated Fly.io's uplinks. Core architectural bet: no distributed consensus (explicitly rejects Raft-family designs over WAN); instead take cues from link-state routing protocols like OSPF, where each router (worker) is the source of truth for its own links (Fly Machines) and responsible for flooding changes. Stack: SWIM membership + QUIC for broadcast/reconciliation + cr-sqlite (CRDT SQLite extension) for last-write-wins-by-logical-timestamp conflict resolution. Convergence in seconds. Consumers read the local SQLite with ordinary SQL queries. Regionalization project now underway: each region runs its own Corrosion cluster with fine-grained Machine data; a global cluster only maps applications to regions. Goal: reduce blast radius of state-bugs to a single region.

Key takeaways

  1. State synchronization is the hardest problem in running Fly.io's platform. "The hardest part of running this platform isn't managing the servers, and it isn't operating the network; it's gluing those two things together." Several times per second, as customer CI/CD pipelines bring Machines up/down, Corrosion blasts updates across the internal mesh so edge proxies from Tokyo to Amsterdam keep an accurate routing table.
  2. Distributed systems are blast amplifiers. "By propagating data across a network, they also propagate bugs in the systems that depend on that data. In the case of Corrosion, our state distribution system, those bugs propagate quickly." Canonical framing for concepts/contagious-deadlock + concepts/blast-radius.
  3. Why no distributed consensus. Fly.io's orchestration model inverts the Kubernetes norm — "individual servers are the source of truth for their workloads." The central API bids work to a "global market of competing 'worker' physical servers." This scales across dozens of regions without bottlenecking on a database that demands both responsiveness and consistency between São Paulo, Virginia, and Sydney."* Bidding-model is elegant for placement but insufficient for request routing** — an HTTP request in Tokyo still needs a global map to find the nearest instance in Sydney. Corrosion fills that gap without demanding global consensus.
  4. Truly global distributed consensus is a face-seeking rake. "Like an unattended turkey deep frying on the patio, truly global distributed consensus promises deliciousness while yielding only immolation. Consensus protocols like Raft break down over long distances." Previous attempt: HashiCorp Consul as a global routing system ("Consul is fantastic software. Don't build a global routing system on it"), then SQLite caches of Consul ("SQLite: also fantastic. But don't do this either"). Fly's Consul cluster "running on the biggest iron we could buy, wasted time guaranteeing consensus for updates that couldn't conflict in the first place." Canonical concepts/no-distributed-consensus.
  5. OSPF is the right mental model, not a database. "A protocol like OSPF has the same operating model and many of the same constraints we do. OSPF is a 'link-state routing protocol', which, conveniently for us, means that routers are sources of truth for their own links and responsible for quickly communicating changes to every other router, so the network can make forwarding decisions." Fly.io runs a globally, fully connected WireGuard mesh between all servers, so the OSPF flooding-algorithm problem (connectivity between arbitrary routers) is already solved. "All we need to do is gossip efficiently."
  6. Stack: SWIM + QUIC + cr-sqlite. Gossip protocol is built on SWIM ("Start with the simplest, dumbest group membership protocol you can imagine: every node spams every node it learns about with heartbeats. Now, just two tweaks: first, each step of the protocol, spam a random subset of nodes, not the whole set. Then, instead of freaking out when a heartbeat fails, mark it 'suspect' and ask another random subset of neighbors to ping it for you. SWIM converges on global membership very quickly."). Post-membership, QUIC between cluster nodes broadcasts changes and reconciles state for new nodes. Conflict resolution via cr-sqlite — changes logged to a crsql_changes table, applied last-write-wins using logical timestamps (causal not wall-clock).
  7. What Corrosion does not do. "No locking, no central servers, and no distributed consensus. Instead, we exploit our orchestration model: workers own their own state, so updates from different workers almost never conflict." Consumers can open Corrosion with stock SQLite and read tables directly — "many customers of Corrosion's data don't even need to know it exists, just where the database is." No leader elections, no update-backlog metrics to chew nails over. "Fast as all get-out."
  8. Outage 1 (worst): 2024-09-01 contagious deadlock. "On September 1, 2024, at 3:30PM EST, a new Fly Machine came up with a new 'virtual service' configuration option a developer had just shipped. Within a few seconds every proxy in our fleet had locked up hard. It was the worst outage we've experienced: a period during which no end-user requests could reach our customer apps at all." The fly-proxy code that handled the Corrosion update had an if let over an RWLock — a notorious Rust concurrency footgun — that assumed in its else branch that the lock had been released. "Instant and virulently contagious deadlock." Corrosion was "just a bystander" — the bug was in the consumer. But gossip made the blast amplifier perfect.
  9. Outage 2: nullable-column DDL apocalypse. "You made a trivial-seeming schema change to a CRDT table hooked up to a global gossip system. Now, when the deploy runs, thousands of high-powered servers around the world join a chorus of database reconciliation messages that melts down the entire cluster." A team member added a nullable column to a large Corrosion table. "New nullable columns are kryptonite to large Corrosion tables: cr-sqlite needs to backfill values for every row in the table. It played out as if every Fly Machine on our platform had suddenly changed state simultaneously, just to fuck us." Canonical concepts/nullable-column-backfill-amplification.
  10. Outage 3: Consul cert expiry → saturated uplinks. Fly.io ran Corrosion and Consul side-by-side for resiliency. A Consul mTLS certificate expired; every worker in the fleet severed Consul, then entered backoff loops retrying Consul. Each retry re-invoked a code path that updated Fly Machine state — each such attempt incurred a Corrosion write. "By the time we've figured out what the hell is happening, we're literally saturating our uplinks almost everywhere in our fleet." Canonical concepts/uplink-saturation-from-backoff — a distributed system's worst failure mode is when a passive dependency accidentally becomes write-amplifying via retry feedback.
  11. Mitigation 1: Tokio watchdogs everywhere. "Our Tokio programs all have built-in watchdogs; an event-loop stall will bounce the service and make a king-hell alerting racket. Watchdogs have cancelled multiple outages. Minimal code, easy win. Do this in your own systems." patterns/watchdog-bounce-on-deadlock applied fleet-wide post-2024-09-01. (The parking_lot post covers the watchdog in fly-proxy specifically; this post generalises to every Fly.io Tokio program.)
  12. Mitigation 2: Antithesis / multiverse debugging. "We spent months looking for similar bugs with Antithesis. Again: do recommend. It retraced our steps on the parking_lot bug easily; the bug wouldn't have been worth the blog post if we'd been using Antithesis at the time. Multiverse debugging is killer for distributed systems." First production endorsement of Antithesis on this wiki, complementary to JP Phillips's 2025-02-12 "if we invested in Antithesis or TLA+" framing as the gate to external adoption of corrosion2. patterns/antithesis-multiverse-debugging.
  13. Mitigation 3: checkpoint backups on object storage. "We keep checkpoint backups of the Corrosion database on object storage. That was smart of us. When shit truly went haywire last year, we had the option to reboot the cluster, which is ultimately what we did. That eats some time (the database is large and propagating is expensive), but diagnosing and repairing distributed systems mishaps takes even longer." Canonical patterns/checkpoint-backup-to-object-storage as the break-glass option for a gossip-cluster meltdown.
  14. Mitigation 4: eliminate partial updates. "Until recently, any time a worker updated its local database, we published the same incremental update to Corrosion. But now we've eliminated partial updates. Instead, when a Fly Machine changes, we re-publish the entire data set for the Machine. Because of how Corrosion resolves changes to its own rows, the node receiving the re-published Fly Machine automatically filters out the no-op changes before gossiping them. Eliminating partial updates forecloses a bunch of bugs (and, we think, kills off a couple sneaky ones we've been chasing). We should have done it this way to begin with." Canonical patterns/eliminate-partial-updates — a CRDT-specific pattern where whole-object republish is safer than delta republish because no-op filtering happens at the CRDT layer. Community link: "Self-healing Machine state synchronization and service discovery".
  15. Mitigation 5: regionalization (two-level state). "After the contagious deadlock bug, we concluded we need to evolve past a single cluster. So we took on a project we call 'regionalization', which creates a two-level database scheme. Each region we operate in runs a Corrosion cluster with fine-grained data about every Fly Machine in the region. The global cluster then maps applications to regions, which is sufficient to make forwarding decisions at our edge proxies. Regionalization reduces the blast radius of state bugs. Most things we track don't have to matter outside their region (importantly, most of the code changes to what we track are also region-local). We can roll out changes to this kind of code in ways that, worst case, threaten only a single region." Canonical patterns/two-level-regional-global-state — the architectural response to a global-state single-cluster having proven to be an unacceptable blast radius. Importantly, nothing about Corrosion's design required the single global state domain — it was an operational default, not a protocol constraint.
  16. What Corrosion is not. "It doesn't rely on distributed consensus, like Consul, Zookeeper, Etcd, Raft, or rqlite (which we came very close to using). It doesn't rely on a large-scale centralized data store, like FoundationDB or databases backed by S3-style object storage. It's nevertheless highly distributed (each of thousands of workers run nodes), converges quickly (in seconds), and presents as a simple SQLite database." Three-shape comparison: consensus-protocol-family (Consul/ZK/etcd/Raft/ rqlite), centralized-store-family (FoundationDB / S3-backed), and Corrosion's own SWIM+QUIC+CRDT-SQLite shape. "rqlite (which we came very close to using)" is explicit disclosure of the rejected nearest-neighbor.
  17. Scope discipline — not everything goes into Corrosion. "Part of what's making Corrosion work is that we're careful about what we put into it. Not every piece of state we manage needs gossip propagation. tkdb, the backend for our Macaroon tokens, is a much simpler SQLite service backed by Litestream. So is Pet Sematary, the secret store we built to replace HashiCorp Vault." Cross-references Operationalizing Macaroons's tkdb stack — not everything needs gossip.
  18. External applicability. "There are probably lots of distributed state problems that want something more like a link-state routing protocol and less like a distributed database. If you think you might have one of those, feel free to take Corrosion for a spin." Open-source invitation; much iteration has been led by Jérôme Gravel-Niquet (original author), Somtochi Onyekwere, and Peter Cai.

Architectural shape

  • Protocol inspiration: OSPF (link-state routing). Each worker owns its own links (Machines) + floods changes. Fly's existing global WireGuard mesh gives connectivity OSPF has to bootstrap.
  • Membership substrate: SWIM (Scalable Weakly-consistent Infection-style process group Membership). Same substrate as HashiCorp Serf / Consul's membership layer. Gossip heartbeats to random subsets; indirect pings on suspect timeout; fast global-membership convergence. Previously documented at Fly in Building clusters with Serf.
  • Reconciliation transport: QUIC. Broadcast updates; reconcile state for new nodes joining.
  • Conflict resolution: cr-sqlite, the CRDT SQLite extension from vlcn-io. Row-level changes to CRDT-managed tables logged in crsql_changes. Updates applied last-write-wins using logical timestamps (causal ordering, not wall-clock). cr-sqlite CRDT details: crdts.md in superfly/corrosion.
  • Presentation: stock SQLite database. Consumers read with ordinary SQL. No API server, no RPC, no leader election. "Many customers of Corrosion's data don't even need to know it exists, just where the database is."
  • Pre-Corrosion ancestors at Fly.io:
  • HashiCorp Consul as global routing system ("don't build a global routing system on it").
  • SQLite caches of Consul (fly.io/blog/a-foolish-consistency"don't do this either").
  • Corrosion (2022-ish through 2024-09-01 single global cluster).
  • corrosion2 + regionalization (2025 two-level regional + global clusters).
  • Authored by: Jérôme Gravel-Niquet; iteration led by Somtochi Onyekwere and Peter Cai.

Numbers + operational datapoints disclosed

  • Propagation convergence: "seconds" (same as the 2024-07-30 Making-Machines-Move post's framing).
  • Gossip cadence: "several times a second" as customer CI/CD pipelines bring Machines up/down.
  • Host-to-host update latency: the 2025-05-28 parking_lot post disclosed "millisecond intervals of time"; this post reaffirms the "within a few seconds every proxy in our fleet had locked up hard" figure on 2024-09-01.
  • Cluster size: "thousands of workers" run Corrosion nodes.
  • Node language: Rust ("Corrosion is a large part of what every engineer at Fly.io who writes Rust works on").
  • Single-cluster size was not numerically disclosed, but "the database is large and propagating is expensive" (checkpoint restore eats "some time").
  • Number of outages Corrosion has been involved in: 3+ discussed (2024-09-01 deadlock, nullable-column DDL, Consul cert expiry).

Numbers + operational details not disclosed

  • SWIM fanout parameter, heartbeat interval, suspect timeout, indirect-ping neighbor count — none of the membership-layer constants.
  • QUIC stream / connection strategy; application-layer framing; congestion-control choice.
  • cr-sqlite replication topology (full-mesh gossip? fan-out tree? hybrid?).
  • Schema or per-table conflict resolution behavior.
  • Regionalization cutover mechanics, region-sizing heuristics, or the exact global-cluster schema ("maps applications to regions" is the only description).
  • Pre/post regionalization blast-radius comparison numbers.
  • Antithesis-harness scope, bugs-caught count, or cycle time.
  • Checkpoint backup cadence or object-storage backend (presumably Tigris or S3, not stated).
  • Tokio watchdog internals (threshold, bounce policy, alert-channel — specific to fly-proxy's case in the parking_lot post but not generalised here).
  • Partial-update-elimination rollout date; no measurement of bug count reduction post-cutover.
  • corrosion vs corrosion2 version mechanics (2025-02-12 exit interview named corrosion2 as the v2 redesign; this post uses "Corrosion" without versioning).

Caveats

  • Voice is blog-retrospective + open-source-invitation, not RFC-or-paper-level technical depth. Many mechanism details (SWIM parameters, QUIC layering, cr-sqlite topology) are referenced by name rather than specified.
  • The 2024-09-01 contagious-deadlock narrative is already covered at greater depth in sources/2025-05-28-flyio-parking-lot-ffffffffffffffff (the parking_lot RwLock double-free wake-up bug). This post re-uses the narrative as motivation; does not add new bug-level detail.
  • The nullable-column backfill outage is newly disclosed here but not dated or quantified.
  • The Consul cert expiry → uplink saturation outage is newly disclosed here but not dated or quantified.
  • Regionalization is described as "unwinding that decision now" — in-progress, not complete. No region count, no rollout timeline.
  • Post explicitly declines to enumerate what cr-sqlite can't do beyond the nullable-column case; CRDT-family limitations (unique constraints, referential integrity, schema migration primitives) not discussed.
  • "Corrosion is a large part of what every engineer at Fly.io who writes Rust works on" is an oblique staffing-cost signal — "it wasn't easy getting here" — but no headcount or person-year figures given.

Relationship to existing wiki

Source

Last updated · 200 distilled / 1,178 read