Skip to content

CONCEPT Cited by 1 source

Probabilistic data structure

Definition

A probabilistic data structure trades exact correctness for sub-linear space or time by returning approximate answers with known error bounds. Unlike an exact structure (hash set, B-tree, sorted array), a probabilistic structure's responses may be wrong — but the error is quantifiable, tunable, and typically one-sided (false positives but no false negatives, or vice versa; see concepts/false-positive-vs-false-negative-asymmetry).

The space-for-accuracy trade-off is tuned by one or two parameters (error rate, bit-width, hash count) that relate the structure's size to its accuracy guarantees.

Canonical families

Family Answers Error shape Typical use
Bloom filter set membership false positives only 404 filters, negative caches, LSM pruning
Counting Bloom set membership + deletion false positives only eviction sets, streaming membership
Cuckoo filter set membership + deletion false positives only similar to Bloom, supports delete
Xor filter set membership (immutable) false positives only per-build artefacts, immutable sets
HyperLogLog approximate cardinality relative error ~1/√m unique-user counts, analytics
Count-min sketch approximate frequency one-sided over-estimate stream top-k, rate limiting
DDSketch quantiles / percentiles relative-rank error latency p50/p95/p99 at scale
t-digest quantiles / percentiles rank error bounded at tails same
MinHash / SimHash set similarity variance shrinks with hash count near-duplicate detection
Skip list ordered membership depth is randomised logarithmic ordered map
Reservoir sampling uniform sample of stream exact distribution streaming analytics
Top-K sketch heavy hitters bounded rank error traffic analysis

Why use one

  • Scale: exact structures for billions of elements blow the memory budget of a single node. Probabilistic structures can fit a billion-element approximate answer in megabytes.
  • Streaming: one-pass, constant-memory updates over unbounded input. Exact structures require unbounded memory for unbounded input.
  • Hot-path: a probabilistic "almost certainly not" answer in sub-microseconds is cheaper than a "definitely not" answer in milliseconds. Downstream fallback handles the rare "might be" answers.
  • Aggregation / merging: many probabilistic structures (HLL, DDSketch, sketches generally) can be merged across shards to answer global questions from local summaries.

Design axes

Every probabilistic structure exposes a trade-off between space, accuracy (error rate or bound), and sometimes insert / query latency. The canonical triangle:

        Space
         /\
        /  \
       /    \
Accuracy ─── Speed
  • Fixed space: smaller error costs slower queries (more hash functions, deeper probes).
  • Fixed accuracy: less space costs slower / more-complex updates.
  • Fixed speed: less space means worse accuracy (shallower sketches).

The asymmetry that makes them useful

A probabilistic structure with one-sided error composes well with a downstream authoritative check: the structure returns "no" authoritatively (a true negative that cuts the chain) or "maybe" (consult the authority). This is the shape that powers:

  • Negative caching (Bloom filter → storage).
  • Column-store pruning (zone map + Bloom → block read).
  • Routing decisions (Bloom filter → 404 or storage).
  • Cache-aside (counting filter → cache check).

A structure with symmetric error (both over- and under- counting, like HLL in its naive form) is less composable and usually reports relative-error bounds ("95 % confidence interval is ±1 %") rather than pass-through semantics.

Seen in

  • sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters — Canonical wiki introduction at the edge-routing altitude. Vercel replaces exact JSON-tree path lookup with a Bloom filter, trading a never-pays-off exact-answer guarantee (the structure was exact but the parse was slow) for a bounded-false-positive answer that's 200× faster at p99. Quote: "Bloom filters can return false positives, but never false negatives. For path lookups, this property is valuable. If the Bloom filter says a path does not exist, we can safely return a 404; if it says a path might exist, we fall back to checking the build outputs."

  • concepts/bloom-filter — Most prominent family member. Vercel's canonical case.

  • concepts/ddsketch-error-bounded-percentile — PlanetScale Insights: relative-rank-error sketch for per-query-pattern latency percentiles. Canonicalised as the quantile-altitude companion to Bloom filter's membership-altitude family.

  • concepts/diff-sketch — Sketches used for compact cross- system state summaries (migration parity, anti-entropy, replication lag).

Last updated · 476 distilled / 1,218 read