PATTERN Cited by 1 source
Trimmed-automaton predicate filter¶
A per-file (or per-row-group) approximate set-membership filter built by:
- Constructing a finite-state automaton (FSA) accepting every value that actually appears in one column of the file.
- "Trimming" that FSA — reducing its state count to a fixed budget — in a direction that guarantees the trim only over-approximates the accepting set. I.e., the trimmed FSA may accept some values that aren't in the column (false positives), but never rejects a value that is in the column (no false negatives).
- Mechanically projecting the FSA to a regular expression.
- Storing the regex in the file's metadata (not alongside the data payload).
At query time the engine evaluates the stored regex against the query's predicate value; if the regex rejects the value, the file cannot contain a matching row and is skipped — no payload GET.
Datadog's Husky event store uses this per-column per-fragment, stored in its FoundationDB-backed metadata repository. (Source: sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale)
What it's competing against¶
- Min/max range metadata (Parquet's standard). Imprecise
because lexicographic ranges on non-leading columns include
values not actually present. Husky's example: a fragment whose
servicecolumn holdscompactor, reader, writer— a range filter wrongly scansservice:metadata(lexicographically betweencompactorandwriter); the FSA-regex^(compactor|reader|writer)$correctly rejects it. - Bloom filters. Same no-false-negatives contract. Husky describes its filter as "similar to a Bloom filter" but with two differences: (a) the regex form is often smaller for low- cardinality structured tag values, and (b) it's faster to evaluate because it short-circuits on the first mismatched prefix.
When it shines¶
- Low-cardinality structured columns. The canonical Husky
examples are observability tags:
service(tens/hundreds of values),status(info/error/…),env(prod/staging/dev). Even higher-cardinality tag columns often have regular structure (machine-generated hostnames, dotted service names) that compresses well into an automaton. - Leveled / concepts/lsm-compaction layouts where higher levels produce narrower per-column value sets, so the filter gets more selective as compaction progresses.
When it falls over¶
- High-cardinality unstructured columns (user IDs, URLs, free- text messages). The FSA either grows past its state budget — forcing aggressive trimming that over-approximates to "accept-all" — or the filter adds metadata cost without pruning.
- Predicates the filter isn't built for. The technique described is equality/set-membership; range predicates on trimmed FSAs would need separate handling.
Build-time vs. query-time cost¶
- Build time (at write / compaction): non-trivial — must enumerate every value, build the FSA, trim it, project to regex. Amortised across many queries, worthwhile.
- Query time: ~regex-match-per-fragment-metadata-row. Cheap enough that Husky evaluates it for every fragment whose time bucket overlaps the query, before deciding which fragments to fetch from object storage.
Seen in¶
- systems/husky — per-sort-column per-fragment filters stored alongside min/max row keys in FoundationDB metadata; used for pre-GET fragment pruning.
- sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale