CONCEPT Cited by 2 sources
Fleet drain operation¶
Definition¶
A fleet-ops capability: mark a physical worker as "no more new placements, and evacuate the workloads you currently host somewhere else". A drained worker can be taken out of service for maintenance, firmware upgrade, hardware swap, or decommissioning without customer-visible disruption. Drain is the load-bearing primitive for large-fleet operations — a cloud provider operates thousands-to-millions of hosts and needs each to be replaceable without an ad-hoc evacuation plan.
Canonical wiki statement¶
Fly.io Sprites, 2026-01-14:
"Worse, from our perspective, is that attached storage anchors workloads to specific physicals. We have lots of reasons to want to move Fly Machines around. Before we did Fly Volumes, that was as simple as pushing a 'drain' button on a server. Imagine losing a capability like that. It took 3 years to get workload migration right with attached storage, and it's still not 'easy'."
(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])
Why drain is so valuable¶
Drain's value concentrates three operational properties:
- Decoupled-from-workload maintenance windows. Worker health and customer workloads no longer synchronise on a shared uptime — the worker can go down while the workload stays up.
- Predictable hardware lifecycle. Rack-a-new-worker-and- drain-the-old is a repeatable playbook. Without drain, each hardware swap becomes a per-workload migration plan.
- Correlated-failure avoidance. A struggling worker (flapping disk, noisy neighbour, DRAM ECC errors) can be drained before it fails, converting a would-be outage into a scheduled migration.
The stateful-VM drain problem¶
Stateless VMs drain trivially — cordon the worker, let existing VMs finish, schedule new VMs elsewhere. Stateful VMs break this: a Machine with a Fly Volume cannot relocate without either copying the volume's bytes or plugging into a shared-storage substrate.
Fly.io's three-year journey described in the 2024 Making Machines Move post:
- Pre-Volumes (up to 2021): drain was a button press.
- Post-Volumes (2021-2024): drain regresses; Fly invents snapshot-then-reschedule, but snapshots are stale-since- snapshot. Not as good as a true drain.
- Mid-2024 onwards: [[patterns/async-block-clone-for- stateful-migration|async block-clone migration]] ships. Workload relocates live — but "it's still not 'easy'."
Sprites restore drain-as-button-press¶
The object-store- rooted disk design makes drain a pointer-move:
- Mark the worker draining.
- For each Sprite on the worker: stop the VM; start a fresh VM on another worker pointing at the same storage URL.
- Cache on the old worker is irrelevant — a cache, not a truth store. See concepts/read-through-nvme-cache.
- No byte copy. No migration plan. No async clone bookkeeping.
Ptacek frames this as a regained capability: object-storage disks don't just enable migration, they restore the trivial drain that Volumes broke.
Operator-visible consequences¶
Drain as a cheap primitive changes ops posture:
- Aggressive hardware rotation becomes possible — e.g., drain-and-reinstall every worker monthly.
- Zone / rack / cabinet maintenance doesn't require per- customer coordination.
- Correlated-failure response becomes an automated playbook: alarm fires → drain-and-reassign.
- Worker fleet heterogeneity becomes easier — mix old and new hardware; drain older hosts on demand.
Drain is distinct from "migration"¶
The two often get used interchangeably but aren't the same:
| Operation | Definition | Orientation |
|---|---|---|
| Drain | Clear a specific worker of workloads | Worker-centric |
| Migration | Move a specific workload to a different worker | Workload-centric |
Drain requires migration as a sub-operation, but is a fleet-operator-level ask ("empty this host"), not a per- workload ask ("move this Machine"). Both are cheap in the Sprites model; both were expensive in the Fly-Volumes model.
Seen in¶
- [[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — canonical wiki statement on drain as regained capability.
- sources/2024-07-30-flyio-making-machines-move — retrospective on losing and partially regaining drain for Fly Volumes.
Related¶
- systems/fly-sprites
- systems/fly-volumes — the feature that broke drain.
- systems/fly-machines — the substrate migration was eventually engineered for.
- concepts/object-storage-as-disk-root — Sprites' architectural basis for trivial drain.
- concepts/durable-state-as-url — the property that reduces drain to a pointer-move.
- patterns/async-block-clone-for-stateful-migration — Fly-Volumes-era workaround.
- patterns/read-through-object-store-volume
- companies/flyio