CONCEPT Cited by 2 sources

Bloom filter¶

Definition¶

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It returns:

Definitely not in the set (a true negative), or
Possibly in the set (which may be a false positive).

It never returns definitely in the set and never returns a false negative. That asymmetry is the load-bearing property that makes Bloom filters useful as the pre-filter in front of a slower authoritative check.

Mechanism¶

A Bloom filter is specified by four parameters and one seed:

n — expected number of elements inserted.
p — desired false-positive rate (e.g. 1e-7 for one FP per 10 million negative queries).
m — size of the underlying bit array (derived from n and p).
k — number of independent hash functions (derived from m and n).
s — seed / salt used to vary the first hash function (the remaining k-1 hashes are typically derived by adding offset constants or composing with counter values).

The formulas:

m = -(n × ln p) / (ln 2)²
k = (m / n) × ln 2

The bit array is initialised to all zeros.

Insert(x): compute k hashes of x, each modulo m, and set each corresponding bit in the array to 1.

Query(x): compute the same k hashes of x, modulo m, and check each bit.

If any bit is 0 → x is definitely not in the set.
If all bits are 1 → x is possibly in the set. Fall back to the authoritative check (e.g. a database lookup, a storage fetch, a slower data structure).

Why the asymmetry¶

A false negative would require a bit that was set during some Insert(y) to have been cleared afterwards. Since insertion only sets bits (never clears them), and the query uses the same k hashes in the same positions, any member of the set must still have all k bits set at query time. So false negatives are impossible.

A false positive occurs when another element's k hashes happen to collide with x's k hash positions, even though x was never inserted. The probability of this grows with the load factor of the bit array and falls exponentially with k (up to the optimal k = (m/n) × ln 2).

Canonical Vercel disclosure¶

Vercel's 2026-04-21 blog post canonicalised the full parameter vector in a production file format:

{"version":"test","bloom":{"n":10,"p":1e-7,"m":336,"k":23,"s":0}}
"0kxC4anU4awVOYSs54vsAL7gBNGK/PrLjKrAJRil64mMxmiig1S+jqyC"

Line 1 is the JSON object with parameters (n, p, m, k, s); line 2 is the Base64-encoded bit array. (Source: sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters.) Vercel uses this shape to pass the filter from their build service (which generates it from the full path list) to their routing service (which queries it on every request).

Trade-offs vs alternatives¶

Structure	Membership	Insertion	Deletion	False pos	False neg	Size
Hash set	exact	yes	yes	none	none	~(n × avg_key_bytes)
B-tree / trie	exact	yes	yes	none	none	~(n × avg_key_bytes)
Bloom filter	probabilistic	yes	no	tunable (p)	none	~(-n × log₂p / ln 2) bits
Counting Bloom filter	probabilistic	yes	yes (via counters)	tunable	none	~4× bits
Cuckoo filter	probabilistic	yes	yes	tunable	none	slightly smaller at same p
Xor filter	probabilistic	batch only	no	tunable	none	~20 % smaller than Bloom

Use a Bloom filter when:

You can tolerate false positives (the fallback is cheap).
You cannot tolerate false negatives (they're structural bugs downstream — missed 404s, SEO breakage, data-loss claims, etc.).
The authoritative set is too large to fit in memory but membership queries are on the hot path.
Insertions are streaming / incremental; deletions are rare or batch-rebuild is acceptable.

Use a Cuckoo filter instead when deletion is a first-class operation (e.g. maintaining an eviction set).

Use a xor filter instead when the set is immutable and you can batch-build (e.g. per-deployment artefacts like Vercel's path list; Vercel could migrate from Bloom to xor for another 20 % size reduction).

Applications¶

Negative caches / 404 filters: return a fast negative without touching storage. Vercel's canonical case.
Column-store / LSM pruning: Apache Parquet, RocksDB, BigQuery, ClickHouse all use Bloom filters per SST / per row group to skip blocks that can't contain a predicate's target. See concepts/fragment-pruning.
Web caches: Google Chrome shipped a Bloom filter of known-malicious URLs so browsers could check locally before confirming with the server.
CDN / DNS negative caches: before hitting authoritative DNS, check a filter of known-nonexistent names.
Distributed systems anti-entropy: compact membership digests exchanged between replicas.
Deduplication: see Google's BigTable, Cassandra, HBase for per-SSTable Bloom filters.
Networking: routing-table approximations, content- addressable routing.

Size intuition¶

For n = 1 million paths and p = 1e-7:

m = -(10⁶ × ln(10⁻⁷)) / (ln 2)²
  = -(10⁶ × -16.12) / 0.4805
  ≈ 33.5 million bits
  ≈ 4.2 MB (uncompressed)
  k = (33.5e6 / 10⁶) × ln 2 ≈ 23 hash functions

For the same set as a JSON tree of paths (~20 chars average × 10⁶ paths): ~20 MB plus tree overhead. Bloom filter wins by 4–5× on size even before Base64 encoding overhead.

Pitfalls¶

Over-sizing m wastes memory; under-sizing inflates p catastrophically (the FP rate grows faster than linearly past the design capacity).
Too few hash functions → under-decoration of the bit array, higher FP rate.
Too many hash functions → the array saturates (many bits set to 1) faster than it should, collision rate spikes.
Inserting beyond capacity n: FP rate diverges from the design target. At 1.5 × n, p roughly doubles. If the set is incremental and growing, scalable Bloom filters (chains of filters with increasing m) preserve target p as capacity grows.
Not hashing path-normalised inputs: querying /foo/ after inserting /foo produces a false negative from the caller's perspective (structural, not FP) if normalisation is inconsistent between insert and query.
No intrinsic support for deletion: see counting Bloom / cuckoo filters.
Cross-language compatibility: the hash choice, bit indexing, and seed semantics must all match byte-for-byte between producer and consumer. Vercel implemented matching Bloom-filter logic in both the build service and routing service's codebases rather than relying on a shared library (Source: sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters).

Seen in¶

sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters — Canonical wiki introduction. Vercel's global routing service replaces a 1.5+ MB JSON path-tree lookup with a Bloom filter to answer "does this path exist in this deployment?" before serving or 404-ing. Path-lookup p99 drops 200× (~100 ms → ~0.5 ms); p999 drops 100× (~250 ms → ~2.4 ms); routing-service heap / memory drops 15 %; aggregate TTFB p75–p99 improves 10 %. The Bloom filter ships to the routing service as JSONL: line 1 = parameters (n, p, m, k, s); line 2 = Base64-encoded bit array. Canonical false-positive / false-negative asymmetry worked example: "If the Bloom filter says a path does not exist, we can safely return a 404; if it says a path might exist, we fall back to checking the build outputs. ... We can't afford false negatives (returning 404 for valid pages), and Bloom filters guarantee this won't happen."
sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads — Bloom filter as a read-path divert gate for partition splitting. Netflix TimeSeries Abstraction uses an in-memory Bloom filter on every read to decide whether to consult the wide_row metadata table (split exists) or read from the original partition (no split). The filter is loaded periodically with partition keys of COMPLETED splits. Bloom-filter check cost: single-digit microseconds, "making this diversion practically invisible to the callers." Memory footprint is monitored: "the filters fit comfortably in each server instance" due to compactness of partition keys plus low ratio of split partitions. Same false-positive / no-false- negative property is load-bearing here as in the Vercel case: a filter miss can safely bypass the metadata lookup; a false-positive only costs the metadata round trip (which returns no entry and the read proceeds against the original — see patterns/keep-original-partition-as-fallback-during-split). Canonicalised as patterns/bloom-filter-redirect-to-split-partition.
— Machine-learning feature store as a Bloom filter. Zalando's benchmark reframes the bloom filter as a lossy compressor for historical features: composite keys f"{user}^{article}" go into one in-process filter; at serving time the ranker enumerates all possible articles and probes the filter per user. 30× memory reduction at lossless AUC (470 MB sketch vs 15 GB conventional KV feature store; AUC 0.7997 vs 0.80); 330× reduction at 90 % of the historical-feature AUC uplift (40 MB, AUC 0.79). False positives flip binary features 0 → 1, absorbed by the classifier as input noise rather than misclassification — the concepts/false-positive-vs-false-negative-asymmetry property is load-bearing. Disqualifying limitations: every serving node must absorb 100 % of write traffic (no partitioning without rebuilding a distributed database), individual deletion is impossible (expiry forces full rebuild), and feature vocabularies must be bounded and enumerable. Introduces the sketching feature store as a third substrate adjacent to online and offline feature stores.
patterns/trimmed-automaton-predicate-filter — Datadog's Husky-columnar-predicate-pruning pattern; lists Bloom filters as the probabilistic-filter sibling of trimmed automata for skipping SST blocks in indexed log queries. Same no-false-negatives contract, different input shape (regex / FSA vs hashed equality).
patterns/two-stage-evaluation — General two-stage filter-then-verify pattern; Bloom filters named as the canonical stage-1 fast-negative for the stage-2 slower authoritative check.
patterns/shadow-migration — Bloom-filter-as-migration- parity check: both old and new systems generate a Bloom filter of their output set; parity check verifies both sides produced the same membership before cutover.
concepts/fragment-pruning — Columnar storage: per- column Bloom filters let the query engine skip row groups that can't contain any matching predicate value.