Skip to content

CONCEPT Cited by 1 source

Block-level async clone

Block-level async clone is the storage-migration primitive where a source block device is cloned asynchronously into a destination device: the clone is immediately usable (attaches instantly to a new consumer), reads of un-hydrated blocks fall through to the source over the network, writes go only to the clone, and a background thread rehydrates blocks from source to destination independently of user I/O.

The decisive property: availability of the destination is decoupled from completion of the data transfer.

Definition

Given a source block device S and a freshly-allocated destination device D of identical size, an async clone presents D to its consumer immediately with these semantics:

  • read(D, block) where block is hydrated → served from D.
  • read(D, block) where block is not hydrated → fetched from S over the network, returned to the reader, and (optionally) written through to D.
  • write(D, block) → served from D; block is marked hydrated and S is never consulted for this block again.
  • A background hydration thread continuously fetches blocks from S to D independently of user I/O, so that eventually the clone is fully hydrated and S can be released.

State is tracked in a metadata bitmap of per-block "is this hydrated?" bits.

Canonical kernel-tier implementation: dm-clone

The Linux kernel ships a production-grade implementation of this concept as dm-clone. See the Fly.io 2024-07-30 post for the map function in-source and a production deployment story.

"dm-clone gives us a new device, of identical size, where reads of uninitialized blocks will pull from the original. It sounds terribly complicated, but it's actually one of the simpler kernel lego bricks."

Why it's load-bearing for stateful-workload migration

Without async clone, stateful-workload migration between physical hosts collapses to two bad options (concepts/kill-copy-boot-migration-tradeoff):

  1. copybootkill — copy the volume, boot the new instance, kill the old one. Loses data because the source keeps writing while the copy runs.
  2. killcopyboot — kill the source, copy the volume, boot the new instance. Too slow — at multi-GB volume sizes and especially with many concurrent migrations, interruption time scales with volume size.

Async clone enables a third option — killcloneboot — where clone returns immediately and the boot step can happen in parallel with rehydration. Interruption time becomes "as fast as stateless migration" (Fly.io); durability is preserved because the source is killed before any new writes happen on the destination.

Shape at other storage layers

The same architectural move appears at other storage layers:

Key enablers

  • Metadata bitmap small vs. data volume large. A block bitmap is orders of magnitude smaller than the data; a metadata device of tens of MB can track hundreds of GB of block state.
  • Network block protocol with adequate fault-tolerance. In Fly.io's case, iSCSI survives network disruption; NBD initially didn't.
  • Filesystem-aware DISCARD support. Sparse volumes — the common case — can short-circuit hydration almost entirely if the filesystem can surface unused blocks. See concepts/trim-discard-integration.

Seen in

Last updated · 200 distilled / 1,178 read