Skip to content

CONCEPT Cited by 1 source

Fleet drain operation

Drain is the fleet-operations primitive of relocating every workload off a specified worker so the worker can be taken out of service — for maintenance, hardware replacement, kernel upgrades, safety rollouts, or simply to de-populate a problematic host. For stateless workloads, drain is trivial: start a replacement instance elsewhere, confirm health, kill the original. For stateful workloads with locally-attached storage, drain is hard and is the subject of the 2024-07-30 Fly.io Making Machines Move post.

Fly.io's runbook evolution

Pre-storage era (2020, stateless)"de-bird that edge server, tell Nomad to drain that worker, and go back to sleep." At Fly's 2020 scale, a fully-loaded stateless worker drained in "just a handful of minutes."

Storage era, pre-clone (2021-2023) — Drain effectively disabled for stateful Machines. Customers with Volumes pinned to a worker could only be evacuated via backup-restore (data loss + downtime) or customer-side HA (not universal).

Storage era, post-clone (summer 2024)"We got to where our infra team can pull 'drain this host' out of their toolbelt without much ceremony." Drain works for stateful Machines via killcloneboot with background hydration, orchestrated by flyd's migration FSMs over a temporary-SAN fabric (patterns/temporary-san-for-fleet-drain).

Future (not yet)"The dream is fully-automated luxury space migration, in which you might get migrated semiregularly, as our systems work not just to drain problematic hosts but to rebalance workloads regularly. No time soon."

Why drain is operationally load-bearing

Without a working drain, a worker with a hardware issue cannot be taken out of service cleanly — the operator must either tolerate degraded performance until customers move off organically, or accept customer-visible downtime. Drain is the primitive that lets hardware fail gracefully.

Seen in

Last updated · 200 distilled / 1,178 read