Skip to content

CONCEPT Cited by 1 source

kill / copy / boot migration tradeoff

The kill / copy / boot migration tradeoff is the classical dilemma that stateful-workload migration faces when relocating a workload anchored to a specific host's local storage. In a naive two-step copy + boot model, there is no ordering of copy, boot, and kill of the original instance that both preserves data and bounds interruption time — hence the need for a third primitive: async clone.

Definition

Given a stateful workload (e.g. a VM with a large attached volume) that needs to be relocated from host A to host B, the operator must sequence three operations:

  • copy — transfer the volume data from A to B.
  • boot — start a new workload instance on B that mounts the transferred volume.
  • kill — stop the old workload instance on A.

Every ordering of these operations has a failure mode:

Order Failure mode
copybootkill Data loss. The original workload on A keeps writing during the copy; any write that lands on A after copy completes is lost when A is killed.
killcopyboot Unbounded downtime. Interruption is (kill latency + data transfer time + boot time). At multi-GB volumes and many concurrent migrations, this is minutes.
bootcopykill / copykillboot / etc. Both. Double-write or split-brain on shared state.

The canonical worked phrasing (Fly.io, 2024-07-30)

From Making Machines Move:

Copy, boot, kill loses data. Kill, copy, boot takes too long.

Fly's workaround: replace copy with a new async clone primitive so the order becomes killcloneboot, where clone returns immediately and data transfer happens in the background on the read path. Interruption becomes "asymptotically as fast as stateless migration."

Workaround space

  • HA on the customer side — if the workload runs redundantly, interruption of a single instance is invisible. "Do this!" But Fly must live in the same world as customers, "many of whom don't run in high-availability configurations."
  • Backup + restore — insufficient: "a 'restore from backup migration' will lose data, and a 'backup and restore' migration incurs untenable downtime." Backup intervals leave a data-loss window; restore interruption is worse than killcopyboot.
  • Pre-copy + delta sync + cutover (the VMware-live-migration shape) — possible for VM memory; for large on-disk volumes, the delta keeps growing if the workload writes continuously.
  • Async clone — the Fly.io answer in the 2024 post. Decouple destination availability from data transfer.

Seen in

Last updated · 200 distilled / 1,178 read