CONCEPT Cited by 2 sources

Probabilistic data structure¶

Definition¶

A probabilistic data structure trades exact correctness for sub-linear space or time by returning approximate answers with known error bounds. Unlike an exact structure (hash set, B-tree, sorted array), a probabilistic structure's responses may be wrong — but the error is quantifiable, tunable, and typically one-sided (false positives but no false negatives, or vice versa; see concepts/false-positive-vs-false-negative-asymmetry).

The space-for-accuracy trade-off is tuned by one or two parameters (error rate, bit-width, hash count) that relate the structure's size to its accuracy guarantees.

Canonical families¶

Family	Answers	Error shape	Typical use
Bloom filter	set membership	false positives only	404 filters, negative caches, LSM pruning
Counting Bloom	set membership + deletion	false positives only	eviction sets, streaming membership
Cuckoo filter	set membership + deletion	false positives only	similar to Bloom, supports delete
Xor filter	set membership (immutable)	false positives only	per-build artefacts, immutable sets
HyperLogLog	approximate cardinality	relative error ~1/√m	unique-user counts, analytics
Count-min sketch	approximate frequency	one-sided over-estimate	stream top-k, rate limiting
DDSketch	quantiles / percentiles	relative-rank error	latency p50/p95/p99 at scale
KLL	quantiles / percentiles	bounded rank error	DataSketches-native percentiles
t-digest	quantiles / percentiles	rank error bounded at tails	same
Theta	distinct count + set algebra	one-sided, relative	audience overlap, incrementality
MinHash / SimHash	set similarity	variance shrinks with hash count	near-duplicate detection
Skip list	ordered membership	depth is randomised	logarithmic ordered map
Reservoir sampling	uniform sample of stream	exact distribution	streaming analytics
Approx top-K	heavy hitters	rare items may drop	trending / leaderboards
Tuple	distinct-count + per-key metric	Theta + metric aggregation	cardinality + revenue

Why use one¶

Scale: exact structures for billions of elements blow the memory budget of a single node. Probabilistic structures can fit a billion-element approximate answer in megabytes.
Streaming: one-pass, constant-memory updates over unbounded input. Exact structures require unbounded memory for unbounded input.
Hot-path: a probabilistic "almost certainly not" answer in sub-microseconds is cheaper than a "definitely not" answer in milliseconds. Downstream fallback handles the rare "might be" answers.
Aggregation / merging: many probabilistic structures (HLL, DDSketch, sketches generally) can be merged across shards to answer global questions from local summaries.

Design axes¶

Every probabilistic structure exposes a trade-off between space, accuracy (error rate or bound), and sometimes insert / query latency. The canonical triangle:

        Space
         /\
        /  \
       /    \
Accuracy ─── Speed

Fixed space: smaller error costs slower queries (more hash functions, deeper probes).
Fixed accuracy: less space costs slower / more-complex updates.
Fixed speed: less space means worse accuracy (shallower sketches).

The asymmetry that makes them useful¶

A probabilistic structure with one-sided error composes well with a downstream authoritative check: the structure returns "no" authoritatively (a true negative that cuts the chain) or "maybe" (consult the authority). This is the shape that powers:

Negative caching (Bloom filter → storage).
Column-store pruning (zone map + Bloom → block read).
Routing decisions (Bloom filter → 404 or storage).
Cache-aside (counting filter → cache check).

A structure with symmetric error (both over- and under- counting, like HLL in its naive form) is less composable and usually reports relative-error bounds ("95 % confidence interval is ±1 %") rather than pass-through semantics.

Seen in¶

sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters — Canonical wiki introduction at the edge-routing altitude. Vercel replaces exact JSON-tree path lookup with a Bloom filter, trading a never-pays-off exact-answer guarantee (the structure was exact but the parse was slow) for a bounded-false-positive answer that's 200× faster at p99. Quote: "Bloom filters can return false positives, but never false negatives. For path lookups, this property is valuable. If the Bloom filter says a path does not exist, we can safely return a 404; if it says a path might exist, we fall back to checking the build outputs."
concepts/bloom-filter — Most prominent family member. Vercel's canonical case.
— Sketching as lossy compression for ML features. Zalando reframes probabilistic data structures as the JPEG-equivalent of feature storage: "They are essentially a lossy compression algorithm for your features. Just like JPEG compression for your images, it can compress input data at varying compression levels." A Bloom-filter-backed in-process feature store achieves ~30× memory reduction at essentially identical click-prediction AUC. The post explicitly names HyperLogLog, Count-Min Sketch, and Quotient Filters as theoretically applicable but unbenchmarked alternatives — Count-Min in particular would give count estimates per composite key instead of presence. Canonical articulation of the freshness axis (concepts/feature-store-freshness) as the fundamental blocker for in-process sketches: individual deletion is structurally impossible, and every serving node must absorb 100 % of write traffic unless you rebuild a distributed database.
concepts/ddsketch-error-bounded-percentile — PlanetScale Insights: relative-rank-error sketch for per-query-pattern latency percentiles. Canonicalised as the quantile-altitude companion to Bloom filter's membership-altitude family.
sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Sketches as first-class, mergeable, storable SQL aggregates. Databricks exposes four Apache DataSketches families as Databricks SQL / DataFrame / Structured Streaming aggregates: KLL (quantiles), Theta (distinct + set algebra), approx top-K (heavy hitters), Tuple (distinct + metric aggregation). Canonicalises decision-support vs audit as the architectural test for when sketches apply, and mergeability as the property that turns them from "a faster percentile" into a storage primitive via patterns/precomputed-sketch-column-in-delta-table. Framing: "If knowing '~4.7M unique users ±1%' leads to the same decision as '4,712,389 unique users,' the approximate answer at a fraction of the cost is strictly better."
concepts/diff-sketch — Sketches used for compact cross- system state summaries (migration parity, anti-entropy, replication lag).