Skip to content

PATTERN Cited by 1 source

Temporary SAN for fleet drain

Intent

Turn a draining worker's locally-attached storage into a network-accessible block device for the duration of the drain, so that target workers elsewhere in the fleet can pull blocks from it on demand to satisfy reads on freshly-booted replica workloads. In short: spin up a SAN you didn't have the rest of the year, just while you're draining.

When to use

  • Your normal storage tier is local-attached (not a SAN).
  • You need to drain workers occasionally but not continuously.
  • Full permanent SAN infrastructure in every region is not affordable or not yet justified.
  • You have a mesh network between workers that can carry block-device traffic (WireGuard, VPC peering, cross-AZ bandwidth).

Structure

The post's canonical phrasing:

To drain a worker with minimal downtime and no lost data, we turn workers into a temporary SANs, serving the volumes we need to drain to fresh-booted replica Fly Machines on a bunch of "target" physicals. Those SANs — combinations of dm-clone, iSCSI, and our flyd orchestrator — track the blocks copied from the origin, copying each one exactly once and cleaning up when the original volume has been fully copied.

Components:

  • Source workers become iSCSI targets for the Volumes being drained.
  • Destination workers become iSCSI initiators and stack dm-clone on top of the remote device.
  • Orchestrator (flyd) tracks which Volumes are being drained from where to where.
  • The SAN exists only during the drain; when hydration completes the target no longer needs the source, and the temporary SAN shape evaporates.

Consequences

Upsides:

  • You get drain without running a SAN full-time. Hardware cost stays at local-NVMe levels; SAN-like capability appears on demand.
  • The fabric is regional-or-less — iSCSI traffic rides the same mesh (Fly's 6PN) that already connects workers.
  • Scales naturally with fleet size — any worker can be an iSCSI source or target when needed.

Downsides:

  • During drain, the source worker is busier than it was before (serving both its remaining workloads and an iSCSI target stream).
  • Network disruption during drain is more impactful — reads from partially-hydrated clones depend on the network block protocol staying up. This is where Fly's NBD-to-iSCSI switch mattered.
  • Orchestrator state-tracking gets more complex — flyd's FSMs have to cope with partial-hydration / failed-migration / cleanup scenarios.
  • You still need a long-term plan. The post gestures at LSVD as Fly's medium-term evolution away from local-NVMe-as-durable.

Relation to classical SAN

A classical SAN (EBS, FlashArray, Ceph) is:

  • Always on. Every compute host sees the SAN all the time.
  • Durable. The SAN is the authoritative copy of the data.
  • Expensive. Fabric + controllers + replication across AZs.

Temporary-SAN-for-drain is:

  • On demand. Only exists when a drain is happening.
  • Not durable in the SAN layer. Durability is still the local NVMe's responsibility; the SAN is a transport for migration, not a store.
  • Cheap. Reuses existing mesh + kernel DM + userspace iSCSI.

Known uses

  • Fly.io (2024) — Canonical wiki instance. The entire shape the 2024-07-30 post describes. "Worker physicals become temporary SANs serving volumes to fresh-booted replica Fly Machines."

Seen in

Last updated · 200 distilled / 1,178 read