Skip to content

FLYIO 2024-07-30

Read original ↗

Fly.io — Making Machines Move

Summary

Fly.io's 2024-07-30 engineering post on the year-long rebuild of their fleet-drain capability for stateful Fly Machines — i.e. machines with Fly Volumes (locally-attached NVMe). Before local storage, a drain was "de-bird that edge server, tell Nomad to drain that worker, and go back to sleep": stateless apps relocate trivially because Fly can start a new instance elsewhere and kill the old one. Local NVMe attached directly to the worker physical broke this — a Fly App with an attached Volume is anchored to a particular worker physical, one bus hop from the data but now undrainable without either data loss (copy, boot, kill lets the original keep writing) or unacceptable downtime (kill, copy, boot blocks for minutes on multi-GB volumes). Customer HA doesn't help — Fly must live in the same world as its customers, many of whom run single-instance.

The fix is a new primitive: clone. Asynchronous, block-level, uses Linux's pre-existing dm-clone device- mapper target — reads of unhydrated blocks fall through to the source device over the network; writes to the clone don't touch the network; a kcopyd thread rehydrates in the background. The migration sequence becomes killcloneboot: clone returns immediately, the new Fly Machine boots attached to a clone volume with mostly empty blocks, and reads against unhydrated blocks transparently fetch from the original worker over a network block protocol. "kill, clone, boot is fast; it can be made asymptotically as fast as stateless migration."

Three moving pieces: (1) dm-clone on the target worker — already in Linux, takes a source device + metadata-bitmap device; (2) a network block protocol to mount the source volume remotely over Fly's WireGuard mesh — tried NBD first, got stuck kernel threads on network disruption, spiked iSCSI and switched; (3) orchestration logic in flyd — conceptually "worker-physicals become temporary SANs serving volumes to fresh-booted replica Fly Machines on target physicals", tracking state in flyd's BoltDB-backed FSMs.

The post is equally about the gnarly complications that the simple architectural story papers over:

  • Per-volume LUKS2 encryption with skew: Fly encrypts Volumes with per-volume keys. dm-clone needs plaintext access on the target worker to run fstrim and short-circuit hydration of unused blocks. "Two different workers, for cursed reasons, might be running different versions of cryptsetup … default to different LUKS2 header sizes — 4MiB and 16MiB." Different header size → different plaintext size. Fix: add an RPC to the migration FSM that carries target LUKS2 header metadata.
  • Corrosion (SWIM-gossip SQLite) assumed worker = source of truth for Machine location. Migration breaks that invariant — race conditions, debugging, design changes. "Corrosion deserves its own post."
  • 6PN (IPv6 Private Network) addresses embed routing to specific worker servers. Fly avoids a global routing protocol by baking the routing destination into the IPv6 address itself (IPv6 + WireGuard = peering). Migration changes the worker → requires a new 6PN address → would be solved by DNS except Fly Postgres cluster configs hardcoded literal IPv6 addresses. Fly first shipped an address-mapping feature in init to keep old addresses reachable, then "burned several weeks doing the direct configuration fix fleet-wide."

Post ends with a nod toward log-structured virtual disks (LSVD) as the medium-term storage direction — NVMe as local cache, S3-grade durability via object-storage persistence — and Fly's regional partner Tigris Data provides the local S3 backend. "By summer of 2024, we got to where our infra team can pull 'drain this host' out of their toolbelt without much ceremony." Still gated from automated rebalancing migrations; "you're probably not getting migrated" unless there's a reason. Closes with the post's self-assessed engineering scale: "This is the biggest thing our team has done since we replaced Nomad with flyd. Only the new billing system comes close."

Key takeaways

  1. Local-NVMe storage trades operational simplicity for a bus-hop latency win. Every architectural complication in the post — the inability to drain, the need for dm-clone, the temporary-SAN shape, iSCSI, the LUKS2 header skew, the 6PN address-embedding problem — is a direct consequence of choosing local NVMe over an EBS-style SAN fabric three years earlier. "A benefit: a Fly App accessing a file on a Fly Volume is never more than a bus hop away from the data. A cost: a Fly App with an attached Volume is anchored to a particular worker physical." Canonical concepts/bus-hop-storage-tradeoff instance.

  2. dm-clone turns a synchronous copy into an asynchronous pull-on-miss. The architectural claim: there is no kill + copy + boot sequence fast enough for large volumes, and no copy + boot + kill sequence that doesn't lose data (the original keeps writing). clone decouples "availability of the destination" from "completion of the data transfer" — the destination is booted immediately with mostly-empty blocks, reads fall through to the source over the network, writes don't hit the network at all, and a background kcopyd thread rehydrates in parallel with user reads. Canonical concepts/block-level-async-clone instance at the kernel- device-mapper tier. Shape-parallel to Cloudflare Artifacts' async clone + hydration on Git repositories and Cloudflare's patterns/blobless-clone-lazy-hydrate — same architectural move at a different storage layer.

  3. NBD is not a fit for globally-distributed network block devices; iSCSI is. "We started out using nbd. But we kept getting stuck nbd kernel threads when there was any kind of network disruption. We're a global public cloud; network disruption happens. Honestly, we could have debugged our way through this. But it was simpler just to spike out an iSCSI implementation, observe that didn't get jammed up when the network hiccuped, and move on." Pragmatic call: trade kernel- module simplicity (NBD is easy to write a userspace server for) for production robustness under adversarial network conditions. Canonical wiki example of "you could have debugged it, but switching was cheaper" — a form of upstream-the-fix inversion where the fix is to stop using the upstream.

  4. DISCARD / TRIM is load-bearing for sparse-volume migrations. "Most people use just a small fraction of the volumes they allocate. A 100GiB volume with just 5MiB used wouldn't be at all weird. You don't want to spend minutes copying a volume that could have been fully hydrated in seconds." The target worker mounts the decrypted source volume, runs fstrim to identify unused blocks, issues DISCARD on the clone device → dm-clone marks those blocks as hydrated in the metadata bitmap without fetching them. Architecturally: use a filesystem-layer signal (TRIM) to short-circuit a block-layer copy operation. Canonical concepts/trim-discard-integration instance.

  5. Fleet-wide configuration skew is a migration tax. "Two different workers, for cursed reasons, might be running different versions of cryptsetup … default to different LUKS2 header sizes — 4MiB and 16MiB. Implying two different plaintext volume sizes." Migration across heterogeneous workers becomes impossible without metadata transfer. Fly's fix: an RPC in the migration FSM that carries the source's LUKS2 header configuration to the target, so the target worker creates the clone device with the right plaintext size. Canonical patterns/fsm-rpc-for-config-metadata-transfer instance — and a specific instance of the general concepts/heterogeneous-fleet-config-skew problem that any long-lived multi-host system accretes.

  6. Embedded-routing-in-address is fragile at migration time. Fly's 6PN — "IPv6 + routing information baked into the address so we can route diverse private networks with constantly changing membership across a global fleet without running a distributed routing protocol" — is load-bearing for Fly's operational simplicity but assumes the routing destination is stable. "Problem: the embedded routing information in a 6PN address refers in part to specific worker servers." Migration breaks the invariant: the new Machine gets a new 6PN address. DNS is the intended escape hatch, but "somebody did use literal IPv6 addresses. It was us. In the configurations for Fly Postgres clusters." Fly first shipped an in-init address- mapping compatibility feature, then bit the bullet and did a fleet-wide config rewrite. Canonical concepts/embedded-routing-in-ip-address instance + concepts/hardcoded-literal-address-antipattern — DNS is an integration contract, not a suggestion.

  7. Migration breaks "worker = source of truth for local Machines." Fly's Corrosion — a SWIM-gossip SQLite database used to connect Fly Machines to the request-routing tier — relied on the invariant that a worker knew definitively which Machines lived on it. "Migration knocks the legs out from under that constraint. Race conditions. Debugging. Design changes." Post-migration, a Machine's identity outlives its worker, and Corrosion had to catch up. (Fly flags this as deserving its own post.)

  8. LSVD (log-structured virtual disk) is the next architectural direction. "We're a lot more interested in log-structured virtual disks (LSVD). LSVD uses NVMe as a local cache, but durably persists writes in object storage." Fly launched LSVD experimentally in 2023; Tigris Data providing regional S3- compatible object storage in every Fly region means LSVD can write through to a local object store rather than backhauling to us-east-1. Shape shift: local-NVMe-as-durable → local-NVMe- as-cache-in-front-of-S3.

  9. "If you see Raft consensus in a design, we've done something wrong." The post's stated design heuristic. "When your problem domain is hard, anything you build whose design you can't fit completely in your head is going to be a fiasco … The virtue of this migration system is that, for as many moving pieces as it has, it fits in your head." The complexity lives in strategic investments Fly has already internalised (flyd FSMs), not in the migration protocol itself. Canonical concepts/simplicity-vs-velocity restatement.

  10. Scale-of-effort datum. "This is the biggest thing our team has done since we replaced Nomad with flyd. Only the new billing system comes close. We did this thing not because it was easy, but because we thought it would be easy." Roughly one year of engineering effort to reach "pull 'drain this host' out of their toolbelt without much ceremony" (by summer 2024). Still not to the point of automated rebalancing migrations.

Systems / concepts / patterns introduced

New systems

  • systems/fly-volumes — Fly.io's locally-attached NVMe volume primitive; the anchor point that forced this rebuild.
  • systems/dm-clone — Linux kernel device-mapper target; takes a source block device + metadata-bitmap device; reads of un- hydrated blocks fall through to the source, writes go to the clone, kcopyd rehydrates in background.
  • systems/iscsi — Internet SCSI; the network block protocol Fly settled on after NBD stuck kernel threads under network disruption.
  • systems/nbd — Network Block Device; Fly's first attempt, abandoned.
  • systems/dm-crypt-luks2 — Linux dm-crypt + LUKS2 on-disk format; Fly uses per-volume encryption keys; heterogeneous cryptsetup defaults across workers meant heterogeneous LUKS2 header sizes.
  • systems/cryptsetup — userland bridge to dm-crypt + LUKS2; version skew across Fly's fleet is the direct cause of the header-size heterogeneity.
  • systems/linux-device-mapper — The Linux block-layer proxy mechanism; dm-linear, dm-stripe, dm-raid1, dm-snap, dm-verity, dm-clone, dm-crypt are all DM targets.
  • systems/corrosion-swim — Fly's SWIM-gossip SQLite database for Machine-to-router routing state; migration broke its worker-as-source-of-truth invariant.
  • systems/lsvd — Log-structured virtual disk; Fly's stated medium-term direction for storage — NVMe local cache + S3 persistent store.
  • systems/nomad — Fly's prior orchestrator, referenced as the substrate from which flyd was carved; its drain operation is what Fly spent a year getting back for stateful Machines.

New concepts

New patterns

  • patterns/async-block-clone-for-stateful-migration — The end-to-end migration recipe: kill + clone + boot with network block protocol + dm-clone + kcopyd background hydration.
  • patterns/temporary-san-for-fleet-drain — The fleet-level shape: "turn workers into temporary SANs serving the volumes we need to drain to fresh-booted replica Fly Machines"; de-bird-proof because of iSCSI's resilience to network disruption.
  • patterns/embedded-routing-header-as-address — 6PN pattern: embed routing information in IPv6 addresses to avoid running a distributed routing protocol, at the cost of requiring address rewrites on migration.
  • patterns/fsm-rpc-for-config-metadata-transfer — The migration-FSM RPC that carries LUKS2 header metadata from source to target worker so the target can create a clone device with the correct plaintext size; generalises to any fleet-heterogeneous per-device configuration that must travel with the workload.
  • patterns/feature-gate-pre-migration-network-rewrite — The 6PN address-rewrite dance Fly did for Fly Postgres: ship an address-mapping compatibility feature in the guest (init) first, then do the fleet-wide config rewrite; the init-side mapping is the bridge that makes the config rewrite non-disruptive.

Operational numbers

  • Before local storage: drain took "a handful of minutes" (stateless workers, 2020 scale).
  • With NVMe before clone: drain interruption hits "minutes, especially if you're moving lots of volumes simultaneously".
  • Typical sparsity: "A 100GiB volume with just 5MiB used wouldn't be at all weird" — motivates DISCARD short-circuiting.
  • LUKS2 header size skew: 4 MiB vs. 16 MiB across two cryptsetup versions on the fleet.
  • Effort scale: ~1 year of engineering from drain-is-broken-for- stateful (post-NVMe-launch) to "pull 'drain this host' out of their toolbelt without much ceremony" (summer 2024).
  • Positioning: "This is the biggest thing our team has done since we replaced Nomad with flyd. Only the new billing system comes close."

Caveats / gaps

  • No quantitative comparison of kill + clone + boot interruption time vs. stateless migration baseline (post claims "asymptotically as fast" but gives no measurement).
  • No hydration-rate numbers (bytes/sec, per-volume hydration- completion distribution, worker-level concurrent-migration ceilings).
  • No discussion of write performance on a partially-hydrated clone device — dm-clone writes go only to the clone, but reads on non-hydrated blocks still fall through to the network; the p99 read latency profile during hydration is not disclosed.
  • Corrosion's redesign is deferred to a separate post ("Corrosion deserves its own post.").
  • LSVD is described as a direction, not an announcement. Fly launched LSVD experimentally in 2023; this post does not disclose production uptake, performance, or the eventual migration story off local-NVMe-as-durable.
  • No cross-region migration discussion — the worked example is worker-xx-cdg1-1worker-xx-cdg1-2 (rack buddies in the same region). Whether the same mechanism works across regions, and what the failure modes look like when the network block protocol runs over longer paths, is not covered.
  • No discussion of eviction / cleanup if a migration fails mid-hydration (the source worker still holds the original volume; the target has a partially-hydrated clone; flyd's FSM presumably handles this but the post doesn't spell it out).
  • Scheduling policy for the temporary-SAN shape is undiscussed — when many workers are draining simultaneously, which source workers serve which target workers, how load is balanced across the iSCSI fabric, and what the hydration prioritisation logic looks like.

Source

Last updated · 200 distilled / 1,178 read