CONCEPT Cited by 2 sources
Probabilistic data structure¶
Definition¶
A probabilistic data structure trades exact correctness for sub-linear space or time by returning approximate answers with known error bounds. Unlike an exact structure (hash set, B-tree, sorted array), a probabilistic structure's responses may be wrong — but the error is quantifiable, tunable, and typically one-sided (false positives but no false negatives, or vice versa; see concepts/false-positive-vs-false-negative-asymmetry).
The space-for-accuracy trade-off is tuned by one or two parameters (error rate, bit-width, hash count) that relate the structure's size to its accuracy guarantees.
Canonical families¶
| Family | Answers | Error shape | Typical use |
|---|---|---|---|
| Bloom filter | set membership | false positives only | 404 filters, negative caches, LSM pruning |
| Counting Bloom | set membership + deletion | false positives only | eviction sets, streaming membership |
| Cuckoo filter | set membership + deletion | false positives only | similar to Bloom, supports delete |
| Xor filter | set membership (immutable) | false positives only | per-build artefacts, immutable sets |
| HyperLogLog | approximate cardinality | relative error ~1/√m | unique-user counts, analytics |
| Count-min sketch | approximate frequency | one-sided over-estimate | stream top-k, rate limiting |
| DDSketch | quantiles / percentiles | relative-rank error | latency p50/p95/p99 at scale |
| KLL | quantiles / percentiles | bounded rank error | DataSketches-native percentiles |
| t-digest | quantiles / percentiles | rank error bounded at tails | same |
| Theta | distinct count + set algebra | one-sided, relative | audience overlap, incrementality |
| MinHash / SimHash | set similarity | variance shrinks with hash count | near-duplicate detection |
| Skip list | ordered membership | depth is randomised | logarithmic ordered map |
| Reservoir sampling | uniform sample of stream | exact distribution | streaming analytics |
| Approx top-K | heavy hitters | rare items may drop | trending / leaderboards |
| Tuple | distinct-count + per-key metric | Theta + metric aggregation | cardinality + revenue |
Why use one¶
- Scale: exact structures for billions of elements blow the memory budget of a single node. Probabilistic structures can fit a billion-element approximate answer in megabytes.
- Streaming: one-pass, constant-memory updates over unbounded input. Exact structures require unbounded memory for unbounded input.
- Hot-path: a probabilistic "almost certainly not" answer in sub-microseconds is cheaper than a "definitely not" answer in milliseconds. Downstream fallback handles the rare "might be" answers.
- Aggregation / merging: many probabilistic structures (HLL, DDSketch, sketches generally) can be merged across shards to answer global questions from local summaries.
Design axes¶
Every probabilistic structure exposes a trade-off between space, accuracy (error rate or bound), and sometimes insert / query latency. The canonical triangle:
- Fixed space: smaller error costs slower queries (more hash functions, deeper probes).
- Fixed accuracy: less space costs slower / more-complex updates.
- Fixed speed: less space means worse accuracy (shallower sketches).
The asymmetry that makes them useful¶
A probabilistic structure with one-sided error composes well with a downstream authoritative check: the structure returns "no" authoritatively (a true negative that cuts the chain) or "maybe" (consult the authority). This is the shape that powers:
- Negative caching (Bloom filter → storage).
- Column-store pruning (zone map + Bloom → block read).
- Routing decisions (Bloom filter → 404 or storage).
- Cache-aside (counting filter → cache check).
A structure with symmetric error (both over- and under- counting, like HLL in its naive form) is less composable and usually reports relative-error bounds ("95 % confidence interval is ±1 %") rather than pass-through semantics.
Seen in¶
-
sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters — Canonical wiki introduction at the edge-routing altitude. Vercel replaces exact JSON-tree path lookup with a Bloom filter, trading a never-pays-off exact-answer guarantee (the structure was exact but the parse was slow) for a bounded-false-positive answer that's 200× faster at p99. Quote: "Bloom filters can return false positives, but never false negatives. For path lookups, this property is valuable. If the Bloom filter says a path does not exist, we can safely return a 404; if it says a path might exist, we fall back to checking the build outputs."
-
concepts/bloom-filter — Most prominent family member. Vercel's canonical case.
-
— Sketching as lossy compression for ML features. Zalando reframes probabilistic data structures as the JPEG-equivalent of feature storage: "They are essentially a lossy compression algorithm for your features. Just like JPEG compression for your images, it can compress input data at varying compression levels." A Bloom-filter-backed in-process feature store achieves ~30× memory reduction at essentially identical click-prediction AUC. The post explicitly names HyperLogLog, Count-Min Sketch, and Quotient Filters as theoretically applicable but unbenchmarked alternatives — Count-Min in particular would give count estimates per composite key instead of presence. Canonical articulation of the freshness axis (concepts/feature-store-freshness) as the fundamental blocker for in-process sketches: individual deletion is structurally impossible, and every serving node must absorb 100 % of write traffic unless you rebuild a distributed database.
-
concepts/ddsketch-error-bounded-percentile — PlanetScale Insights: relative-rank-error sketch for per-query-pattern latency percentiles. Canonicalised as the quantile-altitude companion to Bloom filter's membership-altitude family.
-
sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Sketches as first-class, mergeable, storable SQL aggregates. Databricks exposes four Apache DataSketches families as Databricks SQL / DataFrame / Structured Streaming aggregates: KLL (quantiles), Theta (distinct + set algebra), approx top-K (heavy hitters), Tuple (distinct + metric aggregation). Canonicalises decision-support vs audit as the architectural test for when sketches apply, and mergeability as the property that turns them from "a faster percentile" into a storage primitive via patterns/precomputed-sketch-column-in-delta-table. Framing: "If knowing '~4.7M unique users ±1%' leads to the same decision as '4,712,389 unique users,' the approximate answer at a fraction of the cost is strictly better."
-
concepts/diff-sketch — Sketches used for compact cross- system state summaries (migration parity, anti-entropy, replication lag).