CONCEPT Cited by 1 source
Probabilistic data structure¶
Definition¶
A probabilistic data structure trades exact correctness for sub-linear space or time by returning approximate answers with known error bounds. Unlike an exact structure (hash set, B-tree, sorted array), a probabilistic structure's responses may be wrong — but the error is quantifiable, tunable, and typically one-sided (false positives but no false negatives, or vice versa; see concepts/false-positive-vs-false-negative-asymmetry).
The space-for-accuracy trade-off is tuned by one or two parameters (error rate, bit-width, hash count) that relate the structure's size to its accuracy guarantees.
Canonical families¶
| Family | Answers | Error shape | Typical use |
|---|---|---|---|
| Bloom filter | set membership | false positives only | 404 filters, negative caches, LSM pruning |
| Counting Bloom | set membership + deletion | false positives only | eviction sets, streaming membership |
| Cuckoo filter | set membership + deletion | false positives only | similar to Bloom, supports delete |
| Xor filter | set membership (immutable) | false positives only | per-build artefacts, immutable sets |
| HyperLogLog | approximate cardinality | relative error ~1/√m | unique-user counts, analytics |
| Count-min sketch | approximate frequency | one-sided over-estimate | stream top-k, rate limiting |
| DDSketch | quantiles / percentiles | relative-rank error | latency p50/p95/p99 at scale |
| t-digest | quantiles / percentiles | rank error bounded at tails | same |
| MinHash / SimHash | set similarity | variance shrinks with hash count | near-duplicate detection |
| Skip list | ordered membership | depth is randomised | logarithmic ordered map |
| Reservoir sampling | uniform sample of stream | exact distribution | streaming analytics |
| Top-K sketch | heavy hitters | bounded rank error | traffic analysis |
Why use one¶
- Scale: exact structures for billions of elements blow the memory budget of a single node. Probabilistic structures can fit a billion-element approximate answer in megabytes.
- Streaming: one-pass, constant-memory updates over unbounded input. Exact structures require unbounded memory for unbounded input.
- Hot-path: a probabilistic "almost certainly not" answer in sub-microseconds is cheaper than a "definitely not" answer in milliseconds. Downstream fallback handles the rare "might be" answers.
- Aggregation / merging: many probabilistic structures (HLL, DDSketch, sketches generally) can be merged across shards to answer global questions from local summaries.
Design axes¶
Every probabilistic structure exposes a trade-off between space, accuracy (error rate or bound), and sometimes insert / query latency. The canonical triangle:
- Fixed space: smaller error costs slower queries (more hash functions, deeper probes).
- Fixed accuracy: less space costs slower / more-complex updates.
- Fixed speed: less space means worse accuracy (shallower sketches).
The asymmetry that makes them useful¶
A probabilistic structure with one-sided error composes well with a downstream authoritative check: the structure returns "no" authoritatively (a true negative that cuts the chain) or "maybe" (consult the authority). This is the shape that powers:
- Negative caching (Bloom filter → storage).
- Column-store pruning (zone map + Bloom → block read).
- Routing decisions (Bloom filter → 404 or storage).
- Cache-aside (counting filter → cache check).
A structure with symmetric error (both over- and under- counting, like HLL in its naive form) is less composable and usually reports relative-error bounds ("95 % confidence interval is ±1 %") rather than pass-through semantics.
Seen in¶
-
sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters — Canonical wiki introduction at the edge-routing altitude. Vercel replaces exact JSON-tree path lookup with a Bloom filter, trading a never-pays-off exact-answer guarantee (the structure was exact but the parse was slow) for a bounded-false-positive answer that's 200× faster at p99. Quote: "Bloom filters can return false positives, but never false negatives. For path lookups, this property is valuable. If the Bloom filter says a path does not exist, we can safely return a 404; if it says a path might exist, we fall back to checking the build outputs."
-
concepts/bloom-filter — Most prominent family member. Vercel's canonical case.
-
concepts/ddsketch-error-bounded-percentile — PlanetScale Insights: relative-rank-error sketch for per-query-pattern latency percentiles. Canonicalised as the quantile-altitude companion to Bloom filter's membership-altitude family.
-
concepts/diff-sketch — Sketches used for compact cross- system state summaries (migration parity, anti-entropy, replication lag).