CONCEPT Cited by 1 source

boot migration tradeoff¶

The kill / copy / boot migration tradeoff is the classical dilemma that stateful-workload migration faces when relocating a workload anchored to a specific host's local storage. In a naive two-step copy + boot model, there is no ordering of copy, boot, and kill of the original instance that both preserves data and bounds interruption time — hence the need for a third primitive: async clone.

Definition¶

Given a stateful workload (e.g. a VM with a large attached volume) that needs to be relocated from host A to host B, the operator must sequence three operations:

copy — transfer the volume data from A to B.
boot — start a new workload instance on B that mounts the transferred volume.
kill — stop the old workload instance on A.

Every ordering of these operations has a failure mode:

Order	Failure mode
`copy` → `boot` → `kill`	Data loss. The original workload on `A` keeps writing during the copy; any write that lands on `A` after `copy` completes is lost when `A` is killed.
`kill` → `copy` → `boot`	Unbounded downtime. Interruption is (kill latency + data transfer time + boot time). At multi-GB volumes and many concurrent migrations, this is minutes.
`boot` → `copy` → `kill` / `copy` → `kill` → `boot` / etc.	Both. Double-write or split-brain on shared state.

The canonical worked phrasing (Fly.io, 2024-07-30)¶

From Making Machines Move:

Copy, boot, kill loses data. Kill, copy, boot takes too long.

Fly's workaround: replace copy with a new async clone primitive so the order becomes kill → clone → boot, where clone returns immediately and data transfer happens in the background on the read path. Interruption becomes "asymptotically as fast as stateless migration."

Workaround space¶

HA on the customer side — if the workload runs redundantly, interruption of a single instance is invisible. "Do this!" But Fly must live in the same world as customers, "many of whom don't run in high-availability configurations."
Backup + restore — insufficient: "a 'restore from backup migration' will lose data, and a 'backup and restore' migration incurs untenable downtime." Backup intervals leave a data-loss window; restore interruption is worse than kill → copy → boot.
Pre-copy + delta sync + cutover (the VMware-live-migration shape) — possible for VM memory; for large on-disk volumes, the delta keeps growing if the workload writes continuously.
Async clone — the Fly.io answer in the 2024 post. Decouple destination availability from data transfer.

Seen in¶

sources/2024-07-30-flyio-making-machines-move — Canonical phrasing of the tradeoff; Fly.io's clone-based workaround.

concepts/block-level-async-clone — The workaround primitive.
concepts/fleet-drain-operation — The operational use case.
patterns/async-block-clone-for-stateful-migration — The full migration recipe that resolves the tradeoff.

kill / copy / boot migration tradeoff¶

Definition¶

The canonical worked phrasing (Fly.io, 2024-07-30)¶

Workaround space¶

Seen in¶

Related¶