Skip to content

SYSTEM Cited by 1 source

Husky

Husky is Datadog's distributed observability event store, layered over cloud object storage (S3 / GCS / Azure Blob) with the query system acting as a cache over that storage. It ingests trillions of events per day and targets the observability read pattern: write-heavy (few-to-no updates of existing rows), queries biased toward young data but with occasional analytical scans over billions of events.

Datadog has published a three-part series: introducing Husky, ingestion deep-dive, and the compaction post ingested here (Source: sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale).

Data model

  • Event → row in a table. Tables are partitioned by time window buckets.
  • Rows within a (table, time-window) bucket live in fragments — immutable objects in object storage.
  • Each fragment uses a custom columnar format loosely modeled on systems/apache-parquet but specially designed for observability data: one row group per column header, column headers inline so readers can stream-discover columns, fixed uniform row-group size across columns within a fragment, embedded column-position skip list. Fragments can carry hundreds of thousands (to millions) of columns after compaction. See concepts/columnar-storage-format.
  • Each fragment is sorted by a sorting schema = zero or more user-chosen columns + timestamp, maintained as an invariant by both ingest and compaction paths. Typical log schema: service, status, env, timestamp.

Metadata repository

Fragment metadata (including the per-column trimmed-regex pruning predicate and the fragment's min/max row keys) lives in a systems/foundationdb-backed metadata repository. FDB transactions give Husky the atomic "swap old fragment set for new fragment" guarantee that compaction needs — queries see the table state either entirely before or entirely after a compaction, never a half-applied one.

Write path

  • Writers buffer events per tenant, cap buffer age, flush one fragment per buffer to object storage.
  • Writers sort the events by the sorting-schema row key, write the fragment, and record min/max row keys + per-sort-column pruning FSA in the fragment header (and metadata).

Compaction (the 2025 post's subject)

Compaction is Husky's central cost lever: query cost is proportional to (fragments fetched from object storage) × (events scanned per fragment), so compaction attacks both terms.

Two cooperating compaction strategies (hybrid LSM — see concepts/lsm-compaction):

  1. Size-tiered compaction. New small fragments enter at size class 0. Within a class, a batch is k-way merged into the next class — eventually producing ~1M-row fragments. Compaction is lazy: Husky waits until "a few dozen fragments" are ready to merge, which Datadog reports is an order-of-magnitude cheaper in CPU + object-store API calls vs. naïve eager compaction.
  2. Locality compaction. Sits above size-tiering. Treats fragments as nodes in exponentially-growing levels L0..Ln; for each level, overlapping row-key ranges are k-way merged into disjoint outputs and promoted once the level's row-limit is exceeded. Higher levels therefore hold narrower row-key ranges per fragment → more prunable.

Both strategies use the same streaming k-way merge primitive — one GET per input fragment, streamed in parallel, one column × one row group at a time, bounded memory (row-group-size × input count), saturated CPU. See patterns/streaming-k-way-merge. Row-group size is chosen adaptively against the heaviest input column (logs can carry 75-KiB message values; millions of events/fragment → a single column's row group can exceed 1 GiB without bounding).

Query path: fragment pruning

For every candidate fragment in the (table, time-window) buckets intersected by a query, Husky evaluates the fragment's stored per-sort-column regex (a mechanical projection of a "trimmed" finite-state automaton built at fragment-write time over all values in the column) against the query's predicates. If the regex rejects the predicate, the fragment is skipped without a GET. This is far more precise than a min/max row-key filter, particularly for non-leading sort-column predicates.

Worked example from the post: a fragment whose service column contains compactor,reader,writer stores regex ^(compactor|reader|writer)$. A query for service:metadata is skipped — even though metadata lexicographically sorts inside [compactor,writer], so a range-only filter would wrongly scan it. See patterns/trimmed-automaton-predicate-filter, concepts/fragment-pruning.

Reported scale / outcomes

  • Ingest: trillions of events per day.
  • Compaction throughput: "thousands of fragments and dozens of GB of data every second".
  • Locality-compaction rollout impact: 30% reduction in the query-worker fleet — Husky's largest cost centre — configured conservatively to use no more CPU than size-tiering alone.
  • "These savings get larger every year as the usage of our system grows."

Not covered here (deferred)

  • Query-execution details: latency-hiding over object storage, avoiding re-execution of the same query.
  • Ingestion internals (covered in the earlier deep-dive).
  • Numeric thresholds for size-tier boundaries, level row-limits, row-group adaptive sizing heuristic.

Seen in

Last updated · 200 distilled / 1,178 read