Dropbox — Improving storage efficiency in Magic Pocket, our immutable blob store¶
Summary¶
Dropbox's Magic Pocket storage team hit an overhead spike after a new erasure-coded write path ("the Live Coder service") rolled out across regions — in the worst cases <5% of a volume's allocated capacity contained live data, because volumes are fixed-size and never reopened once closed in an immutable blob store, each under-filled volume burned a full disk allocation. The existing baseline compaction strategy (L1 — a host-plus-donor packing scheme that tops off already-dense volumes) continued to make progress but was architecturally unable to clear a long tail of severely under-filled volumes at exabyte scale. Fix was three compaction strategies running concurrently over disjoint fill-level ranges — L1 for steady-state (mostly-full donors top off a host), L2 for the middle (dynamic-programming-packed combinations of moderately-under-filled volumes into one near-full destination, 30–50% lower compaction overhead vs. L1-only over a week, 2–3× faster overhead reduction than L1), L3 for the sparsest tail (continuously streams live blobs from near-empty volumes into the Live Coder as a re-encoding pipeline; immediate reclaim once drained) — plus a dynamic control loop on the host eligibility threshold driven by fleet overhead signals, replacing static manual tuning. Rolled out with rate limits per strategy, per-cell traffic locality (no cross-DC compaction), per-strategy candidate ordering, and explicit metadata-pressure throttling because L3's new-volume writes rewrite every blob → full location-entry updates on Panda (metadata) vs. L1/L2's donor-only rewrites. Overhead was driven below the previous pre-incident baseline. Canonical statement of distribution-shift-aware compaction: steady-state assumptions break when the fill-distribution moves, and the fix is multiple strategies over disjoint segments of the distribution rather than a single heuristic.
Key takeaways¶
- A compaction strategy embeds an assumption about the fill-level distribution. L1 is a host-plus-donor packing scheme that assumes most volumes are already close to full — reads tens of GiB across one host + several donors but on average "fewer than one full volume is reclaimed per run, since only donors are fully drained." When the Live Coder rollout shifted the distribution so that a large population of volumes held <50% live data, L1's progress rate became the binding constraint on overhead recovery. (Source: sources/2026-04-02-dropbox-magic-pocket-storage-efficiency-compaction)
- Under-filled volume = full disk allocation. Because Magic Pocket volumes are fixed size and never reopened once closed (concepts/immutable-object-storage), an under-filled volume is indistinguishable from a full one in raw-capacity terms. "If a volume is half full of live data, we are effectively using twice the storage needed for that data. If only ten percent is live, we are using about ten times the space required." This is the direct translation from fragmentation to fleet cost at exabyte scale.
- Garbage collection ≠ reclamation. GC marks deleted blobs but doesn't free disk; reclamation is compaction's job — re-gathering live blobs into new volumes and retiring the old ones. Dropbox processes "millions of deletes each day" and cannot shrink footprint unless compaction keeps up. concepts/garbage-collection and compaction are separate stages of the same reclamation pipeline.
- L2 is a bounded packing problem solved by dynamic programming.
For each run: scale live-byte counts by a granularity factor
(shrinks the DP table), DP over (volume-index, count, capacity)
tracking max packed bytes, keep a parallel choice table for
reconstruction, backtrack from the best
(k, capacity)cell. Two production tuning knobs: max source volumes per run (caps combinatorial search) and granularity (coarsens live-byte counts for the DP). Result: near-full destination volumes from the under-filled middle of the distribution, compaction overhead dropped 2–3× faster than L1 alone, 30–50% lower compaction overhead over a week in L2-enabled cells. - L3 repurposes the Live Coder as a streaming re-encoding pipeline. Compaction in an erasure-coded store is structurally a re-encoding operation: read live blobs, write new encoded volumes. L3 feeds the sparsest volumes' remaining live blobs into the same Live Coder that caused the incident, lets it accumulate and encode over time, and reclaims each source volume immediately after it is drained. Per-volume rewrite cost is low (sparse volumes have little live data), but every blob gets a new volume identity → new location entry in the Panda metadata stack, so L3's metadata pressure is structurally higher than L1/L2.
- Each strategy targets a disjoint range of the fill-level distribution. L1 = high-fill donors (steady state), L2 = mid-range under-filled (DP packs into near-full destinations), L3 = sparse tail (streams through re-encoder). Explicit eligibility boundaries between strategies + per-strategy candidate ordering prevent cross-strategy interference while running concurrently. "Splitting the problem across multiple strategies gave us coverage across the full range of volume fill levels" — canonical multi-strategy compaction statement.
- Manual tuning of the host eligibility threshold doesn't scale. The threshold determines "when a volume qualifies for compaction" — too high, too few volumes eligible, overhead rises; too low, compute + I/O spent reclaiming little space. Replaced with a dynamic control loop driven by fleet overhead signals: rising overhead ⇒ raise threshold (prioritize high-yield runs); stabilizing overhead ⇒ lower threshold (stay responsive to deletes without over-compacting). Same reasoning as Robinhood's PID over static load-balancing weights — the named-scarce-resource here is compaction throughput against moving overhead.
- Metadata capacity is the binding downstream constraint at exabyte scale. In L1/L2, "many blobs can stay under the same volume identity, so only donor blobs need location rewrites." In L3, "blobs are written into brand-new volumes, so most blobs need new location entries." The consequence is the compaction strategy mix must be tuned with the metadata system's write budget in mind — not just with storage/compute. Routing the sparsest volumes through L3, rate-limiting each strategy, and keeping compaction traffic local to each cell (no cross-cluster bandwidth) are the three controls Dropbox used.
- Distribution-shift monitoring is the new feedback loop. The incident went unnoticed for weeks; the response includes metrics for how much data Live Coder is producing, how full volumes are across the fleet, and how storage overhead changes week over week — "the goal is to catch shifts in how data is distributed before overhead rises too far so that we can respond proactively instead of scrambling to recover later." The failure mode that forced the redesign is itself now a first-class observability signal.
Architecture¶
The incident¶
- Live Coder service rolled out gradually over several months to new regions, writing data directly into erasure-coded volumes and bypassing the initial replicated write path. Background-write write-amplification dropped (the originally-intended win).
- Side effect: volumes created through the Live Coder path were severely under-filled — worst cases <5% live data in a full- sized allocation.
- Blast radius: fragmentation spike → storage overhead rose (concepts/storage-overhead-fragmentation) → effective replication factor rose as an early signal (more raw bytes consumed per live byte than expected).
- Why L1 couldn't recover it: L1 is a packing-problem solver that picks one host (already high-fill) and donor volumes fitting into its free space. One run consumes tens of GiB of I/O and on average reclaims <1 full volume per run — only donors are fully drained, the host is just topped off. Against a long tail of very- sparse volumes this is arithmetically too slow.
The three-strategy stack¶
| Strategy | Target fill range | Mechanism | Destination | Metadata cost per reclaimed volume |
|---|---|---|---|---|
| L1 | High-fill donors | Topping-off packing (one host + donor set) | Reused host volume | Low (only donor blobs rewritten) |
| L2 | Middle (moderately under-filled) | Dynamic-programming multi-volume packing under capacity cap | New near-full destination | Low (donor-class — same volume-identity shape as L1) |
| L3 | Sparse tail (e.g. <10% live) | Streaming feed into Live Coder; continuous re-encoding into new volumes | New volumes produced by Live Coder | High — every blob gets new volume identity + new location entry |
Concurrency: all three strategies run simultaneously over disjoint eligibility ranges + rate-limited per-path + traffic kept local per cell to bound cross-cluster bandwidth.
L2's DP packing (pseudocode from the post)¶
Inputs:
volumes[] with LiveBytes
maxVolBytes (destination volume capacity)
maxVolumesToUse (count cap)
granularity (scaling factor)
1) Scale live bytes and capacity by granularity to shrink the DP table.
2) DP over (i = volume index, k = count, c = capacity), keeping max packed bytes.
3) Track choices in a parallel "choice" table for reconstruction.
4) Backtrack from the best (k, capacity) to recover the selected volumes.
- Granularity: coarsens the capacity axis so the DP table fits in practical memory at production scale.
- Max volumes per run: caps combinatorial blow-up in the
(i, k)axis and keeps per-run I/O bounded. - Planner concurrency: tuned alongside the other two to balance packing quality vs compute + memory.
L3's streaming shape¶
- Not a bounded-batch DP like L2 — an open stream feeding the Live Coder continuously with live blobs drained from severely under-filled volumes.
- Live Coder accumulates and encodes into new volumes over time.
- Each source volume reclaimed immediately once drained — the dominant reclaim unit, because sparse volumes by definition contain relatively little data to rewrite per volume recovered.
- The post explicitly notes: L3 is, structurally, constrained re-encoding. "Take live data from one set of volumes and produce a new, durable volume."
Safeguards¶
- Host eligibility threshold → dynamic control loop instead of static knob (patterns/dynamic-control-loop-tuning).
- Candidate ordering per strategy: L1 conservative (keep placement risk + metadata load low), L2 aggressive (denser packings → more reclaim per run), L3 sparsest-first (minimize rewrite per reclaimed volume).
- Rate limits per strategy path (storage + compute + metadata
- network).
- Cell-local traffic (no cross-DC compaction bandwidth).
- Metadata-aware throttling — L3 routed for the sparsest tail specifically because its metadata cost is lower per reclaimed volume (few blobs per reclaimed volume), offsetting the higher cost per blob.
Operational numbers¶
- L2 in production (vs L1-only cells): 30–50% lower compaction overhead over a week.
- L2 overhead-reduction speed: 2–3× faster than L1.
- Worst-case volume fill observed in the incident: <5% live data per volume allocation.
- Final overhead: driven below the pre-incident baseline (exact number not disclosed).
Concepts, patterns, systems introduced / extended¶
Introduced¶
- concepts/storage-overhead-fragmentation — the fill-level-to- overhead translation: storage overhead = raw capacity / live bytes; fragmentation degrades that ratio; immutable-store deletes accumulate waste that only compaction can reclaim.
- concepts/write-amplification — named in the post as the original Live Coder target ("reduced write amplification for background writes, so each write triggered fewer backend storage operations").
- patterns/multi-strategy-compaction — the load-bearing architectural move: L1 + L2 + L3 run concurrently over disjoint eligibility ranges; each embeds a different distributional assumption; explicit eligibility boundaries prevent interference.
- patterns/dynamic-control-loop-tuning — replace a static per-strategy tuning knob (host eligibility threshold) with a feedback loop on fleet-level signals (overhead). Same class of primitive as Robinhood's PID over endpoint weights.
- patterns/streaming-re-encoding-reclamation — reuse an erasure-coder as a streaming reclamation pipeline: pull live blobs from near-empty volumes, feed continuously, new volumes emitted durably; trades low per-volume rewrite for high metadata cost per blob.
- systems/live-coder — Dropbox's on-the-fly erasure-coding service; originally for background writes, now doubles as the L3 reclamation engine.
- systems/panda-metadata — Dropbox's petabyte-scale transactional key-value metadata stack (stub; linked from the post).
Extended¶
- systems/magic-pocket — compaction architecture, the distribution-shift incident, multi-strategy compaction as steady- state operational discipline at exabyte scale.
- concepts/erasure-coding — EC as not just durability but as a re-encoding substrate usable in the reclamation path (compaction is, structurally, a constrained re-encode).
- concepts/immutable-object-storage — the reclaim-by-rewrite consequence at the block-store layer: "volumes cannot be modified once closed" → compaction (gather live blobs, write new volumes, retire old) is the only freeing mechanism. Block-store sibling of S3-Files' stage-and-commit — same root property (objects/volumes are immutable once written), different mutation-above-it strategy.
- concepts/lsm-compaction — comparison point: LSM compaction merges sorted runs by key range (size-tiered / leveled / hybrid); Magic Pocket compaction packs opaque blob volumes by free-capacity fit (host-plus-donor / DP packing / streaming re-encode). Same "reclaim waste from an append-only substrate" category, different data-model shape → different algorithms.
- concepts/garbage-collection — GC marks deleted blobs; compaction physically frees space; explicit separation emphasized in the post.
- companies/dropbox — adds Magic Pocket's storage-efficiency engineering discipline as a first-class sibling of Nucleus / Robinhood / 7th-gen hardware / Dash.
Caveats¶
- No percentage disclosed for the overhead delta relative to the pre-incident baseline — just "below the previous baseline" and the week-level 30–50% / 2–3× numbers relative to L1-only cells.
- No raw volume size, cell size, or absolute TB/PB recovered disclosed.
- L3's metadata throughput ceiling named but not quantified ("limits are in place to prevent overwhelming those systems").
- DP-packing granularity, max-volumes-per-run, and planner concurrency are named as production tuning knobs but concrete values not published.
- Dynamic-control-loop algorithm (PID? simple threshold? smoothed gradient?) is not specified; only the inputs (overhead) and the directionality of the adjustment are described.
- No discussion of failure modes inside the new compaction paths themselves — partial writes, Live-Coder-stall recovery, L3 backpressure to writers.
- Monitoring additions named (Live Coder production rate, fleet fill distribution, week-over-week overhead) but no concrete alarm thresholds or SLO numbers.
Raw / primary source¶
- Raw file:
raw/dropbox/2026-04-02-improving-storage-efficiency-in-magic-pocket-our-immutable-b-4aa5378c.md - Original URL: https://dropbox.tech/infrastructure/improving-storage-efficiency-in-magic-pocket-our-immutable-blob-store
- Acknowledgments in the post: Tommy Dean (L2 strategy) and Lisa Kosiachenko (automation contributions).