Skip to content

CONCEPT Cited by 2 sources

Fleet drain operation

Definition

A fleet-ops capability: mark a physical worker as "no more new placements, and evacuate the workloads you currently host somewhere else". A drained worker can be taken out of service for maintenance, firmware upgrade, hardware swap, or decommissioning without customer-visible disruption. Drain is the load-bearing primitive for large-fleet operations — a cloud provider operates thousands-to-millions of hosts and needs each to be replaceable without an ad-hoc evacuation plan.

Canonical wiki statement

Fly.io Sprites, 2026-01-14:

"Worse, from our perspective, is that attached storage anchors workloads to specific physicals. We have lots of reasons to want to move Fly Machines around. Before we did Fly Volumes, that was as simple as pushing a 'drain' button on a server. Imagine losing a capability like that. It took 3 years to get workload migration right with attached storage, and it's still not 'easy'."

(Source: [[sources/2026-01-14-flyio-the-design- implementation-of-sprites]])

Why drain is so valuable

Drain's value concentrates three operational properties:

  1. Decoupled-from-workload maintenance windows. Worker health and customer workloads no longer synchronise on a shared uptime — the worker can go down while the workload stays up.
  2. Predictable hardware lifecycle. Rack-a-new-worker-and- drain-the-old is a repeatable playbook. Without drain, each hardware swap becomes a per-workload migration plan.
  3. Correlated-failure avoidance. A struggling worker (flapping disk, noisy neighbour, DRAM ECC errors) can be drained before it fails, converting a would-be outage into a scheduled migration.

The stateful-VM drain problem

Stateless VMs drain trivially — cordon the worker, let existing VMs finish, schedule new VMs elsewhere. Stateful VMs break this: a Machine with a Fly Volume cannot relocate without either copying the volume's bytes or plugging into a shared-storage substrate.

Fly.io's three-year journey described in the 2024 Making Machines Move post:

  • Pre-Volumes (up to 2021): drain was a button press.
  • Post-Volumes (2021-2024): drain regresses; Fly invents snapshot-then-reschedule, but snapshots are stale-since- snapshot. Not as good as a true drain.
  • Mid-2024 onwards: [[patterns/async-block-clone-for- stateful-migration|async block-clone migration]] ships. Workload relocates live — but "it's still not 'easy'."

Sprites restore drain-as-button-press

The object-store- rooted disk design makes drain a pointer-move:

  • Mark the worker draining.
  • For each Sprite on the worker: stop the VM; start a fresh VM on another worker pointing at the same storage URL.
  • Cache on the old worker is irrelevant — a cache, not a truth store. See concepts/read-through-nvme-cache.
  • No byte copy. No migration plan. No async clone bookkeeping.

Ptacek frames this as a regained capability: object-storage disks don't just enable migration, they restore the trivial drain that Volumes broke.

Operator-visible consequences

Drain as a cheap primitive changes ops posture:

  • Aggressive hardware rotation becomes possible — e.g., drain-and-reinstall every worker monthly.
  • Zone / rack / cabinet maintenance doesn't require per- customer coordination.
  • Correlated-failure response becomes an automated playbook: alarm fires → drain-and-reassign.
  • Worker fleet heterogeneity becomes easier — mix old and new hardware; drain older hosts on demand.

Drain is distinct from "migration"

The two often get used interchangeably but aren't the same:

Operation Definition Orientation
Drain Clear a specific worker of workloads Worker-centric
Migration Move a specific workload to a different worker Workload-centric

Drain requires migration as a sub-operation, but is a fleet-operator-level ask ("empty this host"), not a per- workload ask ("move this Machine"). Both are cheap in the Sprites model; both were expensive in the Fly-Volumes model.

Seen in

  • [[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — canonical wiki statement on drain as regained capability.
  • sources/2024-07-30-flyio-making-machines-move — retrospective on losing and partially regaining drain for Fly Volumes.
Last updated · 542 distilled / 1,571 read