Skip to content

CONCEPT Cited by 3 sources

Git delta compression

Git delta compression is how Git makes its pack files compact: rather than storing a full compressed copy of every version of every file, Git represents most objects as deltas (diffs) against nearby base objects.

The pairing heuristic

For delta compression to work, Git has to choose candidate pairs to compare. Naïvely comparing every object against every other would be O(N²). So Git uses a heuristic: sort candidate objects and then compare each object only against a window of nearby objects in that sort order.

The default sort heuristic uses the last 16 characters of the file path. Files whose trailing 16 characters match are assumed likely to contain related content, so they end up near each other in the sort order and get paired for delta compression.

This works well in most codebases: src/auth/login.ts and src/auth/logout.ts both end in characters within the comparable window; different versions of the same file share their whole path.

The failure mode

The heuristic fails when the distinguishing portion of the path falls before the 16-character window. Then files that look similar in their last 16 characters — but actually belong to different logical groups — get paired, and Git computes a delta between them. The delta is large (they're not really similar), but Git still stores it, inflating the pack file.

Canonical instance — Dropbox's i18n layout:

i18n/metaserver/[language]/LC_MESSAGES/[filename].po

The language code (en, es, ja, ...) is before the LC_MESSAGES/[filename].po suffix. Different languages' .po files tie for the trailing 16 chars, so Git happily pairs them — producing multi-language deltas instead of in-language deltas. A routine translation update for one language would then generate a disproportionately large pack contribution, and the repo grew at 20–60 MB/day (spikes above 150 MB/day) on track to hit GHEC's 100 GB repo limit in months. Root cause was not committed payload — it was structural mismatch between the directory layout and the heuristic. See sources/2026-03-25-dropbox-reducing-monorepo-size-developer-velocity.

Fixes

Two shapes of fix are available:

  • --path-walk (experimental). Walk the full directory structure to select delta candidates instead of using the 16-char heuristic. Effective — Dropbox measured 84 GB → 20 GB locally — but incompatible with GitHub's server-side optimizations (bitmaps, delta islands) that keep clone/fetch fast. Not a production-viable option on GitHub.
  • Tuned server-side repack with aggressive --window=250 --depth=250. Keeps the default pairing heuristic — so it stays compatible with all server-side infrastructure — but widens the candidate-object search and allows deeper delta chains, which tends to find good deltas even when the initial sort is suboptimal. Slower (~9h for the Dropbox monorepo) but platform-compatible. This was the production fix.
  • Directory-layout reshape. The long-term structural fix: move distinguishing path components (e.g. language code) inside the 16-char trailing window so the default heuristic does the right thing on its own. Dropbox reshaped its i18n workflow after the repack as the root-cause remediation.

Generalization

Heuristics embed assumptions about how you'll use a tool. When your usage pattern diverges from those assumptions, performance can degrade quietly. The Dropbox retrospective frames this as the broader lesson: "Tools embed assumptions. When your usage patterns diverge from those assumptions, performance can degrade quietly over time … Understanding those internal mechanics was what allowed us to diagnose the issue correctly." The same principle surfaces across the wiki whenever a default turns out to be load-bearing — see e.g. concepts/postgres-mvcc-hot-updates for a Postgres analogue (default HOT-update eligibility interacting badly with ON CONFLICT DO UPDATE's lock-before-predicate semantics).

Seen in

  • sources/2026-03-25-dropbox-reducing-monorepo-size-developer-velocity — Dropbox traces 87 GB monorepo growth to this heuristic; fixes via tuned server-side repack + i18n layout reshape.
  • sources/2026-04-16-cloudflare-artifacts-versioned-storage-that-speaks-git — Cloudflare reimplements Git delta encoding/decoding from scratch in pure Zig compiled to Wasm as part of the ~100 KB Wasm Git server powering Artifacts. Artifacts takes a different stance on the storage/recomputation boundary than vanilla Git: "We avoid calculating our own git deltas, instead, the raw deltas and base hashes are persisted alongside the resolved object." — trading storage footprint for lower runtime CPU/memory in the ~128 MB Durable Object envelope.
  • sources/2026-04-17-cloudflare-shared-dictionaries-compression-that-keeps-up-with-the-agent — sibling-at-HTTP-layer delta-compression mechanism: HTTP dcb/dcz + shared-dictionary compression. Same "compress-against-similar-content" idea as Git's pack layer, but: (a) pairing policy simplified — previous-version-of-this-URL is always the candidate dictionary; no sort-and-window-by-path-trailing-16-chars heuristic, no quiet-degradation failure mode (Git-style pairing mismatch doesn't exist because there's no pairing at all, just one candidate); (b) applied at HTTP layer — Cloudflare's edge serves Content-Encoding: dcz responses with the diff against a client-cached dictionary, vs Git's on-disk pack-file deltas; (c) different storage/recomputation trade-off — HTTP shared dictionaries keep the dictionary bytes at the edge + in the client cache, not encoded into a pack-file format. Both mechanisms share the broader lesson that delta-encoding against the right reference object is 10-100× better than stateless compression.
Last updated · 200 distilled / 1,178 read