PATTERN Cited by 1 source
Streaming k-way merge (compaction)¶
A compaction pattern for concepts/lsm-compaction over concepts/columnar-storage-format fragments whose inputs live in high-latency, high-bandwidth object storage (S3, GCS, ABS). Makes three simultaneous guarantees that a naïve "download → merge → upload" compactor can't:
- 1 GET per input fragment. Each input is pulled exactly once, streamed, never re-read.
- CPU-saturated merge. The merge loop is never I/O-blocked; the object-store fetches run in parallel with the merge.
- Bounded memory. Fixed memory per compaction regardless of
input fragment sizes — typically
row-group-size × input-count, small enough to fit "less than the memory available for one CPU".
Together, these are what let Husky run at "thousands of fragments and dozens of GB of data every second" on a compactor-worker fleet. (Source: sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale)
Structural requirements on the file format¶
The merge only works cheaply if the fragment format gives you:
- Discover-as-you-stream column layout. Column headers appear inline in the stream (Husky) or in a prefix block, so the reader doesn't have to seek to a footer before it can start decoding. Parquet's footer-at-end design doesn't satisfy this directly — Husky calls out that its format is Parquet-like but specially designed to support streaming compaction.
- Row groups across all columns of uniform size. This is what bounds memory during the merge.
- A sorting schema — zero or more columns + timestamp — that every writer (and previous compactor) has honoured. The merge relies on all inputs already being sorted.
The merge shape¶
for each column C (in fragment schema order):
open C's column stream in every input fragment
for each row group in the global sort order:
k-way-merge one row group's worth of C values
(driven by the sort-schema key already materialised)
write C's row group to the output fragment
release the per-input row-group buffers
Per the Husky post: "compaction can proceed one column at a time, one row group at a time, with a bounded amount of memory being used".
The driving key is computed once¶
The sort key is materialised up-front by reading only the sort-schema columns from all inputs. The global row order is derived from that; subsequent columns are written in that order without re-consulting the key columns. This is what separates a streaming k-way merge from a repeated "scan every file to find the next row" approach.
Adaptive row-group sizing¶
Row-group size in the output is chosen against the heaviest input
column, so memory stays bounded even when a fragment contains very
large column values (Husky: log message up to 75 KiB × millions of
events). If the compactor detects mid-write that the row group has
overshot the target, it restarts with a smaller row-group size —
the cost of a restart is cheaper than a row-group-mid-merge OOM.
Output = one or more disjoint fragments¶
A compaction is free to emit multiple output fragments that are disjoint in the sort order — the invariant kept is that the output set, together, is sorted. This is how leveled/locality compaction (see concepts/lsm-compaction) turns an overlapping run set into a non-overlapping one at the next level.
Seen in¶
- systems/husky — Datadog's observability event store; compaction is the cost lever, this merge shape is what makes it cheap enough to use aggressively.
- sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale
Where it generalises¶
Any LSM-over-object-storage system merging concepts/columnar-storage-format runs faces the same constraints; the pattern generalises to analytics-over-lake systems that own their own file format. Engines that merge Parquet files directly (many analytics DBs) pay a less-bounded memory cost because Parquet's footer-then-row-groups layout doesn't stream as cleanly.