Skip to content

DATADOG 2025-01-29 Tier 3

Read original ↗

Datadog — Husky: Efficient compaction at Datadog scale

Summary

Third post in Datadog's Husky series (after introducing-husky and the deep-dive on ingestion). Husky is an observability event store layered over object storage (S3 / GCS / Azure Blob). This post is a detailed walk through Husky's compaction subsystem: why it exists, how the custom Parquet-like fragment format lets compaction stream entire files through fixed memory, how size-tiered + locality compaction combine into a hybrid LSM, and — the most novel piece — how Husky uses a trimmed finite-state automaton serialised as a regex in fragment metadata to prune fragments at query time far more precisely than min/max row-key ranges would allow. Reported outcome: the locality-compaction layer, conservatively configured, enabled a 30% reduction in the query-worker pool when first rolled out — directly saving 30% of the cost of Husky's most expensive component.

Key takeaways

  1. Husky's cost model ties compaction to query cost. Query cost is proportional to (fragments fetched from object storage) × (events scanned per fragment). Compaction is Husky's lever on both terms: fewer, bigger, more-selective fragments → cheaper queries. systems/husky
  2. Write path buffers events per tenant, caps buffer age, flushes one fragment per buffer to object storage. Fragment metadata is stored in a FoundationDB-backed repository; compactions swap "old → new fragment set" atomically in one FDB transaction, so scans always see a single consistent view before/during/after a compaction. systems/foundationdb
  3. "Right-sized" fragment is a multi-way trade-off. Too-small: pays object-store latency + under-utilises vectorised scanners. Too-large: kills query parallelism. Compaction is deferred lazily — wait until "a few dozen fragments" can be merged — which Datadog quotes as an order-of-magnitude CPU + object-store API cost reduction vs. naïvely compacting on arrival.
  4. Custom Parquet-like columnar format, optimised for streaming compaction. Single row group per column header layout, column headers inline so readers can discover columns as they stream; fixed uniform row-group size across columns within a fragment bounds memory. Fragments can carry hundreds of thousands to millions of columns post-compaction — Husky doesn't control schemas coming in. Compaction proceeds one column × one row group at a time, so memory is bounded to (row-group-size × input-fragment-count). concepts/columnar-storage-format
  5. Compaction is a streaming k-way merge, not a concatenation. A sorting schema — zero or more columns + timestamp — defines a lexicographic global order that the ingest AND compaction paths maintain as an invariant. Each compaction reads the sort-schema columns from all inputs, builds the global sort order, then streams each column from all inputs simultaneously, writing one column at a time in the output. Output = one or more fragments, disjoint w.r.t. sort order. patterns/streaming-k-way-merge
  6. Row-group sizing is adaptive to column heaviness. Log message columns allow up to 75 KiB each; millions of events per fragment; → a single "heavy" row group can need >1 GiB. Compactor bounds memory by sizing row groups against the heaviest column in the inputs; if a mid-write fragment exceeds the target, it restarts with a smaller row group size.
  7. Hybrid LSM = size-tiered ∘ locality (leveled) compaction. All data lives in (table, time-window) buckets — only same-bucket fragments compact together. Size-tiering merges exponentially-sized classes up to ~1M-row fragments. Then locality compaction treats fragments as nodes in L0..Ln levels; overlapping ranges on the same level are k-way merged into disjoint outputs; levels grow exponentially in row-count, so higher levels hold narrower row-key ranges. concepts/lsm-compaction
  8. Fragment pruning via trimmed FSA → regex in metadata. Instead of min/max row-key filters (which become imprecise for non-leading sort-columns), Husky builds, per sort-column per fragment, a finite-state automaton matching all values in that column, "trimmed" so it may over-match but never under-matches — a bloom- filter-like no-false-negatives property. The FSA is mechanically converted to a regex and stored in the fragment metadata (FDB). At query time, metadata is fetched for all candidate fragments, and each regex is evaluated against query predicates — fragments whose regex rejects the predicate value are skipped entirely. patterns/trimmed-automaton-predicate-filter, concepts/fragment-pruning
  9. Locality-compaction beats range-only pruning by example. A fragment whose service column contains compactor,reader,writer yields regex ^(compactor|reader|writer)$. A query for service:metadata correctly skips this fragment — even though metadata lexicographically sorts between compactor and writer, which a min/max-key filter would NOT skip.
  10. Reported outcome. Locality compaction, conservatively configured to use no more CPU than size-tiering alone, produced a 30% reduction in the query-worker fleet at rollout, against the single largest cost centre in Husky. Scaling multiplier: "savings get larger every year as the usage of our system grows".

Systems / concepts / patterns mentioned

Scale / architecture numbers cited

  • Event ingest: "trillions of events per day" from a huge range of data sources.
  • Compaction throughput: "compacting thousands of fragments and processing dozens of GB of data every second".
  • Fragment size progression: writer output ~thousand rows → compacted ~1M rows; "few dozen fragments" is the minimum batch for the first-tier compaction to fire lazily.
  • Columns per fragment: "hundreds of thousands (if not millions)" of columns after compaction on large tenants.
  • Row-group heaviness: log message column up to 75 KiB/event; 15k such events already exceed 1 GiB/column; target is millions of events/fragment.
  • Typical log sort schema: service, status, env, timestamp.
  • Outcome: 30% query-worker-fleet reduction from locality compaction rollout (Husky's largest cost centre).

Caveats / what's deferred

  • Datadog explicitly frames this post as "compaction only"; the query execution system — "methods for hiding object storage latency and avoiding executing the same query twice" — is deferred to a future post.
  • Sort-schema selection is described only at the product-analytics level ("some analysis of query patterns … suffices"); the mechanics of that analysis are not covered.
  • Row-group "restart" logic (detecting an overshoot mid-write and restarting with a smaller row-group size) is described but the heuristic for how much smaller is not given.
  • Locality-compaction promotion thresholds ("level row-limit") are described qualitatively; no numbers on level sizes or counts.

Tier / scope call

Tier 3 (Datadog is not in AGENTS.md tier lists — treated as Tier-3- equivalent). Clearly on-topic per the scope filter: distributed-systems internals, concrete scaling trade-offs, storage-architecture design decisions with quoted production impact. Full ingest.

Last updated · 200 distilled / 1,178 read