Skip to content

CONCEPT Cited by 1 source

Bloom filter

Definition

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It returns:

  • Definitely not in the set (a true negative), or
  • Possibly in the set (which may be a false positive).

It never returns definitely in the set and never returns a false negative. That asymmetry is the load-bearing property that makes Bloom filters useful as the pre-filter in front of a slower authoritative check.

Mechanism

A Bloom filter is specified by four parameters and one seed:

  • n — expected number of elements inserted.
  • p — desired false-positive rate (e.g. 1e-7 for one FP per 10 million negative queries).
  • m — size of the underlying bit array (derived from n and p).
  • k — number of independent hash functions (derived from m and n).
  • s — seed / salt used to vary the first hash function (the remaining k-1 hashes are typically derived by adding offset constants or composing with counter values).

The formulas:

m = -(n × ln p) / (ln 2)²
k = (m / n) × ln 2

The bit array is initialised to all zeros.

Insert(x): compute k hashes of x, each modulo m, and set each corresponding bit in the array to 1.

Query(x): compute the same k hashes of x, modulo m, and check each bit.

  • If any bit is 0 → x is definitely not in the set.
  • If all bits are 1 → x is possibly in the set. Fall back to the authoritative check (e.g. a database lookup, a storage fetch, a slower data structure).

Why the asymmetry

A false negative would require a bit that was set during some Insert(y) to have been cleared afterwards. Since insertion only sets bits (never clears them), and the query uses the same k hashes in the same positions, any member of the set must still have all k bits set at query time. So false negatives are impossible.

A false positive occurs when another element's k hashes happen to collide with x's k hash positions, even though x was never inserted. The probability of this grows with the load factor of the bit array and falls exponentially with k (up to the optimal k = (m/n) × ln 2).

Canonical Vercel disclosure

Vercel's 2026-04-21 blog post canonicalised the full parameter vector in a production file format:

{"version":"test","bloom":{"n":10,"p":1e-7,"m":336,"k":23,"s":0}}
"0kxC4anU4awVOYSs54vsAL7gBNGK/PrLjKrAJRil64mMxmiig1S+jqyC"

Line 1 is the JSON object with parameters (n, p, m, k, s); line 2 is the Base64-encoded bit array. (Source: sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters.) Vercel uses this shape to pass the filter from their build service (which generates it from the full path list) to their routing service (which queries it on every request).

Trade-offs vs alternatives

Structure Membership Insertion Deletion False pos False neg Size
Hash set exact yes yes none none ~(n × avg_key_bytes)
B-tree / trie exact yes yes none none ~(n × avg_key_bytes)
Bloom filter probabilistic yes no tunable (p) none ~(-n × log₂p / ln 2) bits
Counting Bloom filter probabilistic yes yes (via counters) tunable none ~4× bits
Cuckoo filter probabilistic yes yes tunable none slightly smaller at same p
Xor filter probabilistic batch only no tunable none ~20 % smaller than Bloom

Use a Bloom filter when:

  • You can tolerate false positives (the fallback is cheap).
  • You cannot tolerate false negatives (they're structural bugs downstream — missed 404s, SEO breakage, data-loss claims, etc.).
  • The authoritative set is too large to fit in memory but membership queries are on the hot path.
  • Insertions are streaming / incremental; deletions are rare or batch-rebuild is acceptable.

Use a Cuckoo filter instead when deletion is a first-class operation (e.g. maintaining an eviction set).

Use a xor filter instead when the set is immutable and you can batch-build (e.g. per-deployment artefacts like Vercel's path list; Vercel could migrate from Bloom to xor for another 20 % size reduction).

Applications

  • Negative caches / 404 filters: return a fast negative without touching storage. Vercel's canonical case.
  • Column-store / LSM pruning: Apache Parquet, RocksDB, BigQuery, ClickHouse all use Bloom filters per SST / per row group to skip blocks that can't contain a predicate's target. See concepts/fragment-pruning.
  • Web caches: Google Chrome shipped a Bloom filter of known-malicious URLs so browsers could check locally before confirming with the server.
  • CDN / DNS negative caches: before hitting authoritative DNS, check a filter of known-nonexistent names.
  • Distributed systems anti-entropy: compact membership digests exchanged between replicas.
  • Deduplication: see Google's BigTable, Cassandra, HBase for per-SSTable Bloom filters.
  • Networking: routing-table approximations, content- addressable routing.

Size intuition

For n = 1 million paths and p = 1e-7:

m = -(10⁶ × ln(10⁻⁷)) / (ln 2)²
  = -(10⁶ × -16.12) / 0.4805
  ≈ 33.5 million bits
  ≈ 4.2 MB (uncompressed)
  k = (33.5e6 / 10⁶) × ln 2 ≈ 23 hash functions

For the same set as a JSON tree of paths (~20 chars average × 10⁶ paths): ~20 MB plus tree overhead. Bloom filter wins by 4–5× on size even before Base64 encoding overhead.

Pitfalls

  • Over-sizing m wastes memory; under-sizing inflates p catastrophically (the FP rate grows faster than linearly past the design capacity).
  • Too few hash functions → under-decoration of the bit array, higher FP rate.
  • Too many hash functions → the array saturates (many bits set to 1) faster than it should, collision rate spikes.
  • Inserting beyond capacity n: FP rate diverges from the design target. At 1.5 × n, p roughly doubles. If the set is incremental and growing, scalable Bloom filters (chains of filters with increasing m) preserve target p as capacity grows.
  • Not hashing path-normalised inputs: querying /foo/ after inserting /foo produces a false negative from the caller's perspective (structural, not FP) if normalisation is inconsistent between insert and query.
  • No intrinsic support for deletion: see counting Bloom / cuckoo filters.
  • Cross-language compatibility: the hash choice, bit indexing, and seed semantics must all match byte-for-byte between producer and consumer. Vercel implemented matching Bloom-filter logic in both the build service and routing service's codebases rather than relying on a shared library (Source: sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters).

Seen in

  • sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters — Canonical wiki introduction. Vercel's global routing service replaces a 1.5+ MB JSON path-tree lookup with a Bloom filter to answer "does this path exist in this deployment?" before serving or 404-ing. Path-lookup p99 drops 200× (~100 ms → ~0.5 ms); p999 drops 100× (~250 ms → ~2.4 ms); routing-service heap / memory drops 15 %; aggregate TTFB p75–p99 improves 10 %. The Bloom filter ships to the routing service as JSONL: line 1 = parameters (n, p, m, k, s); line 2 = Base64-encoded bit array. Canonical false-positive / false-negative asymmetry worked example: "If the Bloom filter says a path does not exist, we can safely return a 404; if it says a path might exist, we fall back to checking the build outputs. ... We can't afford false negatives (returning 404 for valid pages), and Bloom filters guarantee this won't happen."

  • patterns/trimmed-automaton-predicate-filter — Datadog's Husky-columnar-predicate-pruning pattern; lists Bloom filters as the probabilistic-filter sibling of trimmed automata for skipping SST blocks in indexed log queries. Same no-false-negatives contract, different input shape (regex / FSA vs hashed equality).

  • patterns/two-stage-evaluation — General two-stage filter-then-verify pattern; Bloom filters named as the canonical stage-1 fast-negative for the stage-2 slower authoritative check.

  • patterns/shadow-migration — Bloom-filter-as-migration- parity check: both old and new systems generate a Bloom filter of their output set; parity check verifies both sides produced the same membership before cutover.

  • concepts/fragment-pruning — Columnar storage: per- column Bloom filters let the query engine skip row groups that can't contain any matching predicate value.

Last updated · 476 distilled / 1,218 read