CONCEPT Cited by 1 source

Fleet drain operation¶

Drain is the fleet-operations primitive of relocating every workload off a specified worker so the worker can be taken out of service — for maintenance, hardware replacement, kernel upgrades, safety rollouts, or simply to de-populate a problematic host. For stateless workloads, drain is trivial: start a replacement instance elsewhere, confirm health, kill the original. For stateful workloads with locally-attached storage, drain is hard and is the subject of the 2024-07-30 Fly.io Making Machines Move post.

Fly.io's runbook evolution¶

Pre-storage era (2020, stateless) — "de-bird that edge server, tell Nomad to drain that worker, and go back to sleep." At Fly's 2020 scale, a fully-loaded stateless worker drained in "just a handful of minutes."

Storage era, pre-clone (2021-2023) — Drain effectively disabled for stateful Machines. Customers with Volumes pinned to a worker could only be evacuated via backup-restore (data loss + downtime) or customer-side HA (not universal).

Storage era, post-clone (summer 2024) — "We got to where our infra team can pull 'drain this host' out of their toolbelt without much ceremony." Drain works for stateful Machines via kill → clone → boot with background hydration, orchestrated by flyd's migration FSMs over a temporary-SAN fabric (patterns/temporary-san-for-fleet-drain).

Future (not yet) — "The dream is fully-automated luxury space migration, in which you might get migrated semiregularly, as our systems work not just to drain problematic hosts but to rebalance workloads regularly. No time soon."

Why drain is operationally load-bearing¶

Without a working drain, a worker with a hardware issue cannot be taken out of service cleanly — the operator must either tolerate degraded performance until customers move off organically, or accept customer-visible downtime. Drain is the primitive that lets hardware fail gracefully.

Seen in¶

sources/2024-07-30-flyio-making-machines-move — Canonical wiki phrasing of the drain runbook and its stateful-workload breakdown; the entire post is a retrospective on getting drain back.

concepts/kill-copy-boot-migration-tradeoff — The ordering problem drain faces for stateful workloads.
concepts/block-level-async-clone — The primitive that solves drain.
patterns/async-block-clone-for-stateful-migration — The migration recipe.
patterns/temporary-san-for-fleet-drain — The fleet-level shape.

Fleet drain operation¶

Fly.io's runbook evolution¶

Why drain is operationally load-bearing¶

Seen in¶

Related¶