Skip to content

CONCEPT Cited by 4 sources

LSM compaction (size-tiered, leveled, hybrid)

Log-Structured Merge (LSM) is the standard storage organisation for write-heavy systems that need to serve reasonably fast point / range / scan queries over the same data. Writes go to small sorted runs on the ingest edge; a compaction process asynchronously merges runs into larger, fewer, more-organised runs over time.

Two classical compaction strategies — and the hybrid Husky uses:

Size-tiered compaction

  • Runs are grouped into size classes (exponentially growing: S, 2S, 4S, …).
  • When enough runs accumulate in one class (typically a handful), they're k-way merged into a single run in the next class up.
  • Optimises write amplification (every byte is rewritten ~log(N)/branching-factor times).
  • Leaves the read side noisier: at any moment, each size class has a few overlapping runs, so point/range reads may have to check multiple runs.

Leveled compaction (Husky's "locality compaction")

  • Runs live in discrete levels L0..Ln.
  • Within a level (except L0), runs are maintained disjoint w.r.t. key range — at most one run covers any given key.
  • Each level has a byte/row limit; when exceeded, some runs are merged down into the next level, k-way-merging with the overlapping runs there to preserve disjointness.
  • Level sizes grow exponentially, so higher levels hold narrower per-run key ranges — a boon for range-pruning on reads.
  • Cost is higher write amplification than size-tiered.

Hybrid (Husky's shape)

Husky applies size-tiered compaction up to a fixed fragment size (~1M rows), then switches to leveled/locality compaction for query-performance reasons, not primarily for read efficiency of a next write. "A hybrid approach: size-tiering the fragments until they are of a certain size, and then locality-compacting them in a 'best-effort' LSM manner to optimize for query execution." (Source: sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale)

Husky explicitly tunes its locality layer to use no more CPU than size-tiering alone — and still reports a 30% reduction in the query-worker fleet from the pruning improvement, because leveled fragments' narrower key ranges combine with concepts/fragment-pruning to skip far more fragments per query.

Lazy compaction ("efficient enough" gating)

A subtle but important design choice in Husky's variant: compaction is deferred rather than eager. "Compactions of recent files are done in a lazy way: they are delayed until the resulting compactions are deemed 'efficient enough' … wait until a few dozen fragments are available to merge together." The quoted payoff is an order-of-magnitude reduction in CPU + object-store API cost vs. naïve "compact as soon as a new fragment arrives".

Partitioning before compaction

LSM compaction is almost always scoped to a bucket. In Husky's case, buckets are (table, time-window) — compactors never merge across buckets. This keeps row counts per compaction bounded and makes the sort order relevant only within a bucket's row-key space.

Design knobs

  • Target fragment size. Too small → object-store latency, poor vectorised-scan throughput. Too large → kills query parallelism (fewer fragments = fewer workers scanning in parallel per query). There is an optimum for a given workload.
  • Level / tier count. Sets write amplification vs. run count.
  • Row-group size inside a run (for concepts/columnar-storage-format payloads). Bounds compaction memory; adaptive in Husky's case.
  • Sort schema. The column(s) + timestamp by which each run is sorted; enables both k-way merge and pruning. Choice of schema comes from query-pattern analysis.

Seen in

  • systems/husky — hybrid size-tiered + leveled/locality compaction over columnar fragments in object storage; 30% query-fleet savings from locality layer.
  • sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale
  • sources/2025-05-20-flyio-litestream-revampedLSM-compaction over SQLite pages instance. Fly.io's redesigned Litestream applies the LSM compaction idea to LTX files — sorted, transaction-aware SQLite-page-range changesets — rather than KV records. Direct quote: "we can easily merge multiple LTX files together and create a new LTX file with only the latest version of each page. … This is similar to how an LSM tree works." See patterns/ltx-compaction for the specialised pattern.
  • sources/2025-10-02-flyio-litestream-v050-is-hereshipped-with-concrete-ladder instance. Litestream v0.5.0 ships a three-level time-window compaction ladder over LTX: "at Level 1, we compact all the changes in a 30-second time window; at Level 2, all the Level 1 files in a 5-minute window; at Level 3, all the Level 2's over an hour." Restore cost bounded to "a dozen or so files on average" regardless of retention depth. Specialisation of leveled LSM compaction with time-window (not size-threshold) level boundaries — unusual among LSM engines (most use byte / row / file-count triggers). Compaction runs inside Litestream, not SQLite: "Performance is limited only by I/O throughput."

Sibling compaction families (non-LSM)

LSM compaction is one instance of the broader "reclaim waste from an append-only substrate" category; the data model under it differs per instance, and that determines the algorithm:

  • LSM DBs (Husky, RocksDB) — units are sorted runs / SSTables keyed by key range; compaction is a k-way merge along the key axis (size-tiered vs leveled vs hybrid — see above).
  • Magic Pocket — units are opaque blob volumes with no key ordering; compaction is a capacity-fit packing problem (host + donors in L1, DP-packing in L2) and a streaming re-encode in L3. Multi-strategy compaction over disjoint volume fill-level ranges (patterns/multi-strategy-compaction).
  • Copy-on-write merge over Parquet (Amazon BDT on Ray) — partitioned into buckets by (table, time-window); compaction is CoW merge within bucket, file-by-reference copies when untouched data outnumbers mutated.

All share the two-stage framing Magic Pocket articulates most cleanly: GC marks, compaction frees — the reclamation algorithm is the data-model-specific rewrite step.

(Source: sources/2026-04-02-dropbox-magic-pocket-storage-efficiency-compaction)

Last updated · 200 distilled / 1,178 read