Skip to content

PATTERN Cited by 1 source

Blobless clone + lazy hydrate

Blobless clone + lazy hydrate is the concrete pattern for materialising a Git repository onto an agent-sandbox / CI-worker / container filesystem without blocking on full-repo download. Perform a blobless clone (--filter=blob:none — fetches trees + refs, omits file contents) synchronously, then hydrate file contents in the background via a lightweight daemon, with reads on not-yet-hydrated files transparently blocking until the required blob arrives.

Introduced to the wiki by Cloudflare's 2026-04-16 [[systems/ artifact-fs|ArtifactFS]] launch (published alongside [[systems/ cloudflare-artifacts|Artifacts]]); see concepts/async-clone-hydration for the concept treatment.

Shape

sandbox starts
 blobless clone ── returns when tree + refs are local  ── agent ready
    └── background daemon starts hydrating blobs
          │    priority: manifests → configs → code → binaries
          ┌──────────────────────────────────────────────┐
          │  filesystem is readable throughout           │
          │  - hot files likely hydrated before first read│
          │  - reads on cold files block until hydrated  │
          └──────────────────────────────────────────────┘
          no sync-back: writes go via ordinary git commit + push

Apply when

  • Repo size is multi-GB / history is long — ordinary synchronous clone is tens-of-seconds to minutes.
  • Startup latency directly gates user-visible work — agent sandbox spin-up, CI worker cold-start, container launch.
  • Read-access pattern is skewed — a small set of files (configs, manifests, a few source files) is read first; most blobs are cold.
  • Writes are handled at Git granularitygit commit + git push rather than filesystem-level sync-back.

Do not apply when

  • Repo fits in-memory / is small (sub-second clone anyway — overhead not worth it).
  • All files will be read immediately (e.g. a full-repo grep at startup) — blocks on every blob anyway.
  • Writes need sync-back through the filesystem (e.g. collaborative-editing FS) — this pattern deliberately declines sync-back.
  • Client can't / won't run a background daemon.

Design choices

Priority ordering

Not arbitrary. ArtifactFS's ordering reflects what agent workloads open first:

  1. Package manifests — package.json, go.mod, pyproject.toml, Cargo.toml, Cargo.lock, ...
  2. Configuration files — .yaml, .toml, .json, .env, CI configs.
  3. Source code — text files in known extensions.
  4. Binary / non-text — images, executables, large blobs deprioritised.

If the deployment is specialised for a different workload (e.g. ML training uses), tune the priority; the shape transfers, the specific ordering does not.

Block vs fail on cold-file read

Two possible behaviours for a read hit on a not-yet-hydrated file:

  • Block (ArtifactFS's choice): the read waits until the background fetch finishes that file. Simple semantics — looks like a slow disk. Requires a deterministic on-demand-fetch path so the reader can't wait forever.
  • Fail-fast with retry hint: return an EAGAIN or similar so the caller can decide to wait or try something else. More plumbing; less transparent.

No sync-back is a feature

ArtifactFS deliberately does not sync edits back to the remote: "with thousands or millions of objects, that's typically very slow, and since we're speaking git, we don't need to." The agent just runs git push. This pattern is specifically paired with patterns/git-protocol-as-api as the write path.

Works with any Git remote

ArtifactFS mounts any Git remote (GitHub, GitLab, Gitea, self-hosted) — not just Artifacts. Means the pattern is portable; you don't have to adopt Cloudflare's server to adopt this startup-latency optimisation.

Claimed outcome (Cloudflare 2026-04-16)

  • Baseline: "popular web framework (at 2.4GB and with a long history!) takes close to 2 minutes to clone" via git clone.
  • Goal: "get large repos down to ~10-15 seconds so that our agent can get to work."
  • Scaling claim: "If you can shave ~90-100 seconds off your sandbox startup time for every large repo, and you're running 10,000 of those sandbox jobs per month: that's 2,778 sandbox hours saved."

Caveats: these are illustrative, not measured-production.

Adjacent / ancestor shapes

Partial clone + lazy blob fetch is not new to Git; git 2.19+ (2018) ships the underlying --filter=blob:none machinery, and related shapes include:

  • git-lfs --smudge=delayed (deferred LFS hydration).
  • EdenFS (Meta's scalable source-control filesystem).
  • GVFS / VFS for Git (discontinued Microsoft project).
  • Various build-farm fetch accelerators.

What this pattern packages newly is: (a) an FS-driver boundary that any workload can mount without git-config changes, (b) agent-aware priority ordering so what an agent reads first is what hydrates first, and (c) sandbox startup as the named use case that justifies the engineering.

Adjacent shape at the block layer

The same async-clone-with-background-hydration shape appears at the kernel block-device tier in Fly.io's fleet-drain migration for stateful Fly Machinesconcepts/block-level-async-clone + patterns/async-block-clone-for-stateful-migration. Reads of un-hydrated blocks fall through to the source over iSCSI, writes go to the clone, kcopyd rehydrates in background. Differences: block-device rather than Git tree; network block protocol rather than Git fetch; no filesystem-layer priority ordering (though TRIM / DISCARD short-circuits hydration of unused blocks). Cross-tier confirmation that the pattern generalises.

Seen in

Last updated · 200 distilled / 1,178 read