PATTERN Cited by 1 source

Temporary SAN for fleet drain¶

Intent¶

Turn a draining worker's locally-attached storage into a network-accessible block device for the duration of the drain, so that target workers elsewhere in the fleet can pull blocks from it on demand to satisfy reads on freshly-booted replica workloads. In short: spin up a SAN you didn't have the rest of the year, just while you're draining.

When to use¶

Your normal storage tier is local-attached (not a SAN).
You need to drain workers occasionally but not continuously.
Full permanent SAN infrastructure in every region is not affordable or not yet justified.
You have a mesh network between workers that can carry block-device traffic (WireGuard, VPC peering, cross-AZ bandwidth).

Structure¶

The post's canonical phrasing:

To drain a worker with minimal downtime and no lost data, we turn workers into a temporary SANs, serving the volumes we need to drain to fresh-booted replica Fly Machines on a bunch of "target" physicals. Those SANs — combinations of dm-clone, iSCSI, and our flyd orchestrator — track the blocks copied from the origin, copying each one exactly once and cleaning up when the original volume has been fully copied.

Components:

Source workers become iSCSI targets for the Volumes being drained.
Destination workers become iSCSI initiators and stack dm-clone on top of the remote device.
Orchestrator (flyd) tracks which Volumes are being drained from where to where.
The SAN exists only during the drain; when hydration completes the target no longer needs the source, and the temporary SAN shape evaporates.

Consequences¶

Upsides:

You get drain without running a SAN full-time. Hardware cost stays at local-NVMe levels; SAN-like capability appears on demand.
The fabric is regional-or-less — iSCSI traffic rides the same mesh (Fly's 6PN) that already connects workers.
Scales naturally with fleet size — any worker can be an iSCSI source or target when needed.

Downsides:

During drain, the source worker is busier than it was before (serving both its remaining workloads and an iSCSI target stream).
Network disruption during drain is more impactful — reads from partially-hydrated clones depend on the network block protocol staying up. This is where Fly's NBD-to-iSCSI switch mattered.
Orchestrator state-tracking gets more complex — flyd's FSMs have to cope with partial-hydration / failed-migration / cleanup scenarios.
You still need a long-term plan. The post gestures at LSVD as Fly's medium-term evolution away from local-NVMe-as-durable.

Relation to classical SAN¶

A classical SAN (EBS, FlashArray, Ceph) is:

Always on. Every compute host sees the SAN all the time.
Durable. The SAN is the authoritative copy of the data.
Expensive. Fabric + controllers + replication across AZs.

Temporary-SAN-for-drain is:

On demand. Only exists when a drain is happening.
Not durable in the SAN layer. Durability is still the local NVMe's responsibility; the SAN is a transport for migration, not a store.
Cheap. Reuses existing mesh + kernel DM + userspace iSCSI.

Known uses¶

Fly.io (2024) — Canonical wiki instance. The entire shape the 2024-07-30 post describes. "Worker physicals become temporary SANs serving volumes to fresh-booted replica Fly Machines."

Seen in¶

sources/2024-07-30-flyio-making-machines-move — Anchor.

patterns/async-block-clone-for-stateful-migration — The per-workload migration recipe the temporary SAN enables.
concepts/fleet-drain-operation — The use case.
systems/iscsi — The chosen substrate.
systems/dm-clone — The consumer on the target side.