CONCEPT Cited by 1 source
kill / copy / boot migration tradeoff¶
The kill / copy / boot migration tradeoff is the classical
dilemma that stateful-workload migration faces when relocating a
workload anchored to a specific host's local storage. In a
naive two-step copy + boot model, there is no ordering of
copy, boot, and kill of the original instance that both
preserves data and bounds interruption time — hence the need for
a third primitive: async
clone.
Definition¶
Given a stateful workload (e.g. a VM with a large attached
volume) that needs to be relocated from host A to host B, the
operator must sequence three operations:
copy— transfer the volume data fromAtoB.boot— start a new workload instance onBthat mounts the transferred volume.kill— stop the old workload instance onA.
Every ordering of these operations has a failure mode:
| Order | Failure mode |
|---|---|
copy → boot → kill |
Data loss. The original workload on A keeps writing during the copy; any write that lands on A after copy completes is lost when A is killed. |
kill → copy → boot |
Unbounded downtime. Interruption is (kill latency + data transfer time + boot time). At multi-GB volumes and many concurrent migrations, this is minutes. |
boot → copy → kill / copy → kill → boot / etc. |
Both. Double-write or split-brain on shared state. |
The canonical worked phrasing (Fly.io, 2024-07-30)¶
From Making Machines Move:
Copy,boot,killloses data.Kill,copy,boottakes too long.
Fly's workaround: replace copy with a new
async clone primitive so
the order becomes kill → clone → boot, where clone
returns immediately and data transfer happens in the background
on the read path. Interruption becomes "asymptotically as fast
as stateless migration."
Workaround space¶
- HA on the customer side — if the workload runs redundantly, interruption of a single instance is invisible. "Do this!" But Fly must live in the same world as customers, "many of whom don't run in high-availability configurations."
- Backup + restore — insufficient: "a 'restore from backup
migration' will lose data, and a 'backup and restore'
migration incurs untenable downtime." Backup intervals leave
a data-loss window; restore interruption is worse than
kill→copy→boot. - Pre-copy + delta sync + cutover (the VMware-live-migration shape) — possible for VM memory; for large on-disk volumes, the delta keeps growing if the workload writes continuously.
- Async clone — the Fly.io answer in the 2024 post. Decouple destination availability from data transfer.
Seen in¶
- sources/2024-07-30-flyio-making-machines-move — Canonical
phrasing of the tradeoff; Fly.io's
clone-based workaround.
Related¶
- concepts/block-level-async-clone — The workaround primitive.
- concepts/fleet-drain-operation — The operational use case.
- patterns/async-block-clone-for-stateful-migration — The full migration recipe that resolves the tradeoff.