DATADOG 2025-01-29 Tier 3

Datadog — Husky: Efficient compaction at Datadog scale¶

Summary¶

Third post in Datadog's Husky series (after introducing-husky and the deep-dive on ingestion). Husky is an observability event store layered over object storage (S3 / GCS / Azure Blob). This post is a detailed walk through Husky's compaction subsystem: why it exists, how the custom Parquet-like fragment format lets compaction stream entire files through fixed memory, how size-tiered + locality compaction combine into a hybrid LSM, and — the most novel piece — how Husky uses a trimmed finite-state automaton serialised as a regex in fragment metadata to prune fragments at query time far more precisely than min/max row-key ranges would allow. Reported outcome: the locality-compaction layer, conservatively configured, enabled a 30% reduction in the query-worker pool when first rolled out — directly saving 30% of the cost of Husky's most expensive component.

Key takeaways¶

Husky's cost model ties compaction to query cost. Query cost is proportional to (fragments fetched from object storage) × (events scanned per fragment). Compaction is Husky's lever on both terms: fewer, bigger, more-selective fragments → cheaper queries. systems/husky
Write path buffers events per tenant, caps buffer age, flushes one fragment per buffer to object storage. Fragment metadata is stored in a FoundationDB-backed repository; compactions swap "old → new fragment set" atomically in one FDB transaction, so scans always see a single consistent view before/during/after a compaction. systems/foundationdb
"Right-sized" fragment is a multi-way trade-off. Too-small: pays object-store latency + under-utilises vectorised scanners. Too-large: kills query parallelism. Compaction is deferred lazily — wait until "a few dozen fragments" can be merged — which Datadog quotes as an order-of-magnitude CPU + object-store API cost reduction vs. naïvely compacting on arrival.
Custom Parquet-like columnar format, optimised for streaming compaction. Single row group per column header layout, column headers inline so readers can discover columns as they stream; fixed uniform row-group size across columns within a fragment bounds memory. Fragments can carry hundreds of thousands to millions of columns post-compaction — Husky doesn't control schemas coming in. Compaction proceeds one column × one row group at a time, so memory is bounded to (row-group-size × input-fragment-count). concepts/columnar-storage-format
Compaction is a streaming k-way merge, not a concatenation. A sorting schema — zero or more columns + timestamp — defines a lexicographic global order that the ingest AND compaction paths maintain as an invariant. Each compaction reads the sort-schema columns from all inputs, builds the global sort order, then streams each column from all inputs simultaneously, writing one column at a time in the output. Output = one or more fragments, disjoint w.r.t. sort order. patterns/streaming-k-way-merge
Row-group sizing is adaptive to column heaviness. Log message columns allow up to 75 KiB each; millions of events per fragment; → a single "heavy" row group can need >1 GiB. Compactor bounds memory by sizing row groups against the heaviest column in the inputs; if a mid-write fragment exceeds the target, it restarts with a smaller row group size.
Hybrid LSM = size-tiered ∘ locality (leveled) compaction. All data lives in (table, time-window) buckets — only same-bucket fragments compact together. Size-tiering merges exponentially-sized classes up to ~1M-row fragments. Then locality compaction treats fragments as nodes in L0..Ln levels; overlapping ranges on the same level are k-way merged into disjoint outputs; levels grow exponentially in row-count, so higher levels hold narrower row-key ranges. concepts/lsm-compaction
Fragment pruning via trimmed FSA → regex in metadata. Instead of min/max row-key filters (which become imprecise for non-leading sort-columns), Husky builds, per sort-column per fragment, a finite-state automaton matching all values in that column, "trimmed" so it may over-match but never under-matches — a bloom- filter-like no-false-negatives property. The FSA is mechanically converted to a regex and stored in the fragment metadata (FDB). At query time, metadata is fetched for all candidate fragments, and each regex is evaluated against query predicates — fragments whose regex rejects the predicate value are skipped entirely. patterns/trimmed-automaton-predicate-filter, concepts/fragment-pruning
Locality-compaction beats range-only pruning by example. A fragment whose service column contains compactor,reader,writer yields regex ^(compactor|reader|writer)$. A query for service:metadata correctly skips this fragment — even though metadata lexicographically sorts between compactor and writer, which a min/max-key filter would NOT skip.
Reported outcome. Locality compaction, conservatively configured to use no more CPU than size-tiering alone, produced a 30% reduction in the query-worker fleet at rollout, against the single largest cost centre in Husky. Scaling multiplier: "savings get larger every year as the usage of our system grows".

Systems / concepts / patterns mentioned¶

Systems: systems/husky (the subject), systems/foundationdb (metadata repo providing atomic fragment-swap transactions), systems/apache-parquet (explicit design reference — "similar to Parquet with one row group and many pages, but specially designed for observability data"), object-store backends (S3, GCS, Azure Blob) referenced generically.
Concepts: concepts/lsm-compaction (size-tiered + leveled hybrid), concepts/columnar-storage-format (row-group-bounded streaming), concepts/fragment-pruning (metadata-driven skip predicate), concepts/oltp-vs-olap (observability = analytical-leaning read pattern on a write-heavy store), concepts/compute-storage-separation (query fleet above object- store persistence).
Patterns: patterns/streaming-k-way-merge (1 GET per input, streamed, one column × one row group at a time, saturate CPU, bounded memory), patterns/trimmed-automaton-predicate-filter (FSA → regex in metadata as a bloom-filter-analog for column-value set membership).

Scale / architecture numbers cited¶

Event ingest: "trillions of events per day" from a huge range of data sources.
Compaction throughput: "compacting thousands of fragments and processing dozens of GB of data every second".
Fragment size progression: writer output ~thousand rows → compacted ~1M rows; "few dozen fragments" is the minimum batch for the first-tier compaction to fire lazily.
Columns per fragment: "hundreds of thousands (if not millions)" of columns after compaction on large tenants.
Row-group heaviness: log message column up to 75 KiB/event; 15k such events already exceed 1 GiB/column; target is millions of events/fragment.
Typical log sort schema: service, status, env, timestamp.
Outcome: 30% query-worker-fleet reduction from locality compaction rollout (Husky's largest cost centre).

Caveats / what's deferred¶

Datadog explicitly frames this post as "compaction only"; the query execution system — "methods for hiding object storage latency and avoiding executing the same query twice" — is deferred to a future post.
Sort-schema selection is described only at the product-analytics level ("some analysis of query patterns … suffices"); the mechanics of that analysis are not covered.
Row-group "restart" logic (detecting an overshoot mid-write and restarting with a smaller row-group size) is described but the heuristic for how much smaller is not given.
Locality-compaction promotion thresholds ("level row-limit") are described qualitatively; no numbers on level sizes or counts.

Tier / scope call¶

Tier 3 (Datadog is not in AGENTS.md tier lists — treated as Tier-3- equivalent). Clearly on-topic per the scope filter: distributed-systems internals, concrete scaling trade-offs, storage-architecture design decisions with quoted production impact. Full ingest.

Links¶

Raw: raw/datadog/2025-01-29-husky-efficient-compaction-at-datadog-scale-da168a3a.md
Original: https://www.datadoghq.com/blog/engineering/husky-storage-compaction/
Prior posts in the series: introducing Husky, Husky ingestion deep-dive