Skip to content

CONCEPT Cited by 3 sources

Async clone + background hydration

Async clone + background hydration is the repository-materialisation shape where a clone-equivalent operation returns as soon as the file tree + refs are present and proceeds to download file contents concurrently in the background, with reads on yet-unhydrated files blocking until their blob has arrived. Introduced to the wiki by Cloudflare's 2026-04-16 ArtifactFS"git clone but async".

Problem it solves

Vanilla git clone is synchronous: the clone command blocks until every reachable object (all history + all blobs) is on disk. For small repos that's fine (sub-second), but:

  • Multi-GB repos with long history take minutes. Cloudflare cite a 2.4 GB web-framework repo at "close to 2 minutes" clone time.
  • Agent / sandbox / CI startup latency is directly gated on clone latency.
  • --depth=1 shallow clones help but discard history agents sometimes want, and still bring down every current-commit blob up front.

Any agent harness that clones on each session pays this cost per session; multiplied across millions of sessions, it becomes a material fleet-level cost.

Mechanism

Two pieces collaborate:

  1. Blobless clone — built on Git's partial-clone machinery (--filter=blob:none). Fetches the file tree (tree objects) and refs, omits blob objects (file contents). File names and paths are present; file contents are not. Clone time dominated by protocol overhead and tree size, not blob volume.
  2. Background-hydration daemon — a lightweight process that, after the blobless clone returns, starts fetching individual blobs in priority order. Reads on not-yet-hydrated files are intercepted by the filesystem and block until that file's blob arrives (on-demand fetch as a fallback, background fetch as the fast path).

Together: agent harness sees a complete-looking directory tree almost immediately; can enumerate files, grep paths, read configs that happen to be already hydrated. The "clone is done" boundary shifts from all blobs local to all tree+refs local.

Priority ordering

Hydration order is not arbitrary — it's tuned for the typical opening actions of an agent workload. ArtifactFS's order:

  1. Package manifests (package.json, go.mod, pyproject.toml, Cargo.toml, ...).
  2. Configuration files (.yaml, .toml, .json).
  3. Source code.
  4. Binaries, images, executables.

The ordering itself is a specialisation — the generalisation is "any FS driver with background-hydration should let the calling workload hint its access pattern so hot files aren't blocked waiting for cold blobs."

Trade-offs

Axis Synchronous clone Async clone + hydration
Startup latency O(repo size) O(tree size) + protocol RTT
Peak bandwidth Bursty at start Spread across hydration window
Read latency (cold file) Always local First-read may block on fetch
Read latency (hot file) Always local Likely local (priority-fetched)
Offline work Full repo available Only-hydrated files readable
Sync-back to remote git push git push (same — no FS-level sync)

Named trade-off from the Cloudflare post: "the filesystem does not attempt to 'sync' files back to the remote repository" — edits are pushed via ordinary Git, not via the FS driver. This is a deliberate simplification.

Not new — but freshly mainstreamed

The underlying Git partial-clone machinery is several years old (Git 2.19+, 2018). What ArtifactFS adds is packaging it as an FS driver with agent-aware priority and sandbox startup as the named use case — raising blobless-clone from a power-user flag to a first-class workload primitive. Similar shape appears in git-lfs --smudge=delayed, Facebook's EdenFS, Microsoft's GVFS / VFS-for-Git (discontinued), and various build-farm fetch accelerators; ArtifactFS is the 2026-era agent-sandbox restatement.

Seen in

  • sources/2026-04-16-cloudflare-artifacts-versioned-storage-that-speaks-gitcanonical wiki instance via ArtifactFS. Savings claim: ~90–100 s per 2.4 GB repo × 10 k sandbox jobs/month = ~2,778 sandbox hours/month (illustrative, not measured).
  • sources/2024-07-30-flyio-making-machines-moveblock-level sibling instance at the Linux device-mapper tier (dm-clone). Fly.io's fleet-drain migration for stateful Fly Machines uses the same async-clone-with-background-hydration shape, just on raw block devices rather than Git trees: reads of un-hydrated blocks fall through to the source over iSCSI, writes go to the clone, kcopyd rehydrates in background. Cross-tier confirmation that the pattern isn't Git-specific.
  • sources/2026-02-04-flyio-litestream-writable-vfsSQLite-database-level instance via Litestream VFS hydration mode. Ben Johnson explicitly credits dm-clone as the ancestor — "we shoplifted a trick from systems like dm-clone: background hydration." The VFS serves reads from object storage while a background loop pulls the whole database to a local temp file (via LTX compaction, writing only the latest version of each page); the read path switches over when hydration completes; the file is discarded on VFS exit. Canonical wiki instance of hydration applied at SQLite-database granularity (previous instances were block-level via dm-clone and Git-tree-level via ArtifactFS). Production consumer: the Fly Sprites "block map" (JuiceFS metadata tier on SQLite + Litestream VFS).
Last updated · 542 distilled / 1,571 read