Skip to content

PATTERN Cited by 1 source

Trimmed-automaton predicate filter

A per-file (or per-row-group) approximate set-membership filter built by:

  1. Constructing a finite-state automaton (FSA) accepting every value that actually appears in one column of the file.
  2. "Trimming" that FSA — reducing its state count to a fixed budget — in a direction that guarantees the trim only over-approximates the accepting set. I.e., the trimmed FSA may accept some values that aren't in the column (false positives), but never rejects a value that is in the column (no false negatives).
  3. Mechanically projecting the FSA to a regular expression.
  4. Storing the regex in the file's metadata (not alongside the data payload).

At query time the engine evaluates the stored regex against the query's predicate value; if the regex rejects the value, the file cannot contain a matching row and is skipped — no payload GET.

Datadog's Husky event store uses this per-column per-fragment, stored in its FoundationDB-backed metadata repository. (Source: sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale)

What it's competing against

  • Min/max range metadata (Parquet's standard). Imprecise because lexicographic ranges on non-leading columns include values not actually present. Husky's example: a fragment whose service column holds compactor, reader, writer — a range filter wrongly scans service:metadata (lexicographically between compactor and writer); the FSA-regex ^(compactor|reader|writer)$ correctly rejects it.
  • Bloom filters. Same no-false-negatives contract. Husky describes its filter as "similar to a Bloom filter" but with two differences: (a) the regex form is often smaller for low- cardinality structured tag values, and (b) it's faster to evaluate because it short-circuits on the first mismatched prefix.

When it shines

  • Low-cardinality structured columns. The canonical Husky examples are observability tags: service (tens/hundreds of values), status (info/error/…), env (prod/staging/ dev). Even higher-cardinality tag columns often have regular structure (machine-generated hostnames, dotted service names) that compresses well into an automaton.
  • Leveled / concepts/lsm-compaction layouts where higher levels produce narrower per-column value sets, so the filter gets more selective as compaction progresses.

When it falls over

  • High-cardinality unstructured columns (user IDs, URLs, free- text messages). The FSA either grows past its state budget — forcing aggressive trimming that over-approximates to "accept-all" — or the filter adds metadata cost without pruning.
  • Predicates the filter isn't built for. The technique described is equality/set-membership; range predicates on trimmed FSAs would need separate handling.

Build-time vs. query-time cost

  • Build time (at write / compaction): non-trivial — must enumerate every value, build the FSA, trim it, project to regex. Amortised across many queries, worthwhile.
  • Query time: ~regex-match-per-fragment-metadata-row. Cheap enough that Husky evaluates it for every fragment whose time bucket overlaps the query, before deciding which fragments to fetch from object storage.

Seen in

Last updated · 200 distilled / 1,178 read