Skip to content

CONCEPT Cited by 2 sources

Probabilistic data structure

Definition

A probabilistic data structure trades exact correctness for sub-linear space or time by returning approximate answers with known error bounds. Unlike an exact structure (hash set, B-tree, sorted array), a probabilistic structure's responses may be wrong — but the error is quantifiable, tunable, and typically one-sided (false positives but no false negatives, or vice versa; see concepts/false-positive-vs-false-negative-asymmetry).

The space-for-accuracy trade-off is tuned by one or two parameters (error rate, bit-width, hash count) that relate the structure's size to its accuracy guarantees.

Canonical families

Family Answers Error shape Typical use
Bloom filter set membership false positives only 404 filters, negative caches, LSM pruning
Counting Bloom set membership + deletion false positives only eviction sets, streaming membership
Cuckoo filter set membership + deletion false positives only similar to Bloom, supports delete
Xor filter set membership (immutable) false positives only per-build artefacts, immutable sets
HyperLogLog approximate cardinality relative error ~1/√m unique-user counts, analytics
Count-min sketch approximate frequency one-sided over-estimate stream top-k, rate limiting
DDSketch quantiles / percentiles relative-rank error latency p50/p95/p99 at scale
KLL quantiles / percentiles bounded rank error DataSketches-native percentiles
t-digest quantiles / percentiles rank error bounded at tails same
Theta distinct count + set algebra one-sided, relative audience overlap, incrementality
MinHash / SimHash set similarity variance shrinks with hash count near-duplicate detection
Skip list ordered membership depth is randomised logarithmic ordered map
Reservoir sampling uniform sample of stream exact distribution streaming analytics
Approx top-K heavy hitters rare items may drop trending / leaderboards
Tuple distinct-count + per-key metric Theta + metric aggregation cardinality + revenue

Why use one

  • Scale: exact structures for billions of elements blow the memory budget of a single node. Probabilistic structures can fit a billion-element approximate answer in megabytes.
  • Streaming: one-pass, constant-memory updates over unbounded input. Exact structures require unbounded memory for unbounded input.
  • Hot-path: a probabilistic "almost certainly not" answer in sub-microseconds is cheaper than a "definitely not" answer in milliseconds. Downstream fallback handles the rare "might be" answers.
  • Aggregation / merging: many probabilistic structures (HLL, DDSketch, sketches generally) can be merged across shards to answer global questions from local summaries.

Design axes

Every probabilistic structure exposes a trade-off between space, accuracy (error rate or bound), and sometimes insert / query latency. The canonical triangle:

        Space
         /\
        /  \
       /    \
Accuracy ─── Speed
  • Fixed space: smaller error costs slower queries (more hash functions, deeper probes).
  • Fixed accuracy: less space costs slower / more-complex updates.
  • Fixed speed: less space means worse accuracy (shallower sketches).

The asymmetry that makes them useful

A probabilistic structure with one-sided error composes well with a downstream authoritative check: the structure returns "no" authoritatively (a true negative that cuts the chain) or "maybe" (consult the authority). This is the shape that powers:

  • Negative caching (Bloom filter → storage).
  • Column-store pruning (zone map + Bloom → block read).
  • Routing decisions (Bloom filter → 404 or storage).
  • Cache-aside (counting filter → cache check).

A structure with symmetric error (both over- and under- counting, like HLL in its naive form) is less composable and usually reports relative-error bounds ("95 % confidence interval is ±1 %") rather than pass-through semantics.

Seen in

  • sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters — Canonical wiki introduction at the edge-routing altitude. Vercel replaces exact JSON-tree path lookup with a Bloom filter, trading a never-pays-off exact-answer guarantee (the structure was exact but the parse was slow) for a bounded-false-positive answer that's 200× faster at p99. Quote: "Bloom filters can return false positives, but never false negatives. For path lookups, this property is valuable. If the Bloom filter says a path does not exist, we can safely return a 404; if it says a path might exist, we fall back to checking the build outputs."

  • concepts/bloom-filter — Most prominent family member. Vercel's canonical case.

  • Sketching as lossy compression for ML features. Zalando reframes probabilistic data structures as the JPEG-equivalent of feature storage: "They are essentially a lossy compression algorithm for your features. Just like JPEG compression for your images, it can compress input data at varying compression levels." A Bloom-filter-backed in-process feature store achieves ~30× memory reduction at essentially identical click-prediction AUC. The post explicitly names HyperLogLog, Count-Min Sketch, and Quotient Filters as theoretically applicable but unbenchmarked alternatives — Count-Min in particular would give count estimates per composite key instead of presence. Canonical articulation of the freshness axis (concepts/feature-store-freshness) as the fundamental blocker for in-process sketches: individual deletion is structurally impossible, and every serving node must absorb 100 % of write traffic unless you rebuild a distributed database.

  • concepts/ddsketch-error-bounded-percentile — PlanetScale Insights: relative-rank-error sketch for per-query-pattern latency percentiles. Canonicalised as the quantile-altitude companion to Bloom filter's membership-altitude family.

  • sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analyticsSketches as first-class, mergeable, storable SQL aggregates. Databricks exposes four Apache DataSketches families as Databricks SQL / DataFrame / Structured Streaming aggregates: KLL (quantiles), Theta (distinct + set algebra), approx top-K (heavy hitters), Tuple (distinct + metric aggregation). Canonicalises decision-support vs audit as the architectural test for when sketches apply, and mergeability as the property that turns them from "a faster percentile" into a storage primitive via patterns/precomputed-sketch-column-in-delta-table. Framing: "If knowing '~4.7M unique users ±1%' leads to the same decision as '4,712,389 unique users,' the approximate answer at a fraction of the cost is strictly better."

  • concepts/diff-sketch — Sketches used for compact cross- system state summaries (migration parity, anti-entropy, replication lag).

Last updated · 542 distilled / 1,571 read