CONCEPT Cited by 2 sources
Bloom filter¶
Definition¶
A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It returns:
- Definitely not in the set (a true negative), or
- Possibly in the set (which may be a false positive).
It never returns definitely in the set and never returns a false negative. That asymmetry is the load-bearing property that makes Bloom filters useful as the pre-filter in front of a slower authoritative check.
Mechanism¶
A Bloom filter is specified by four parameters and one seed:
n— expected number of elements inserted.p— desired false-positive rate (e.g.1e-7for one FP per 10 million negative queries).m— size of the underlying bit array (derived fromnandp).k— number of independent hash functions (derived frommandn).s— seed / salt used to vary the first hash function (the remaining k-1 hashes are typically derived by adding offset constants or composing with counter values).
The formulas:
The bit array is initialised to all zeros.
Insert(x): compute k hashes of x, each modulo m, and
set each corresponding bit in the array to 1.
Query(x): compute the same k hashes of x, modulo m,
and check each bit.
- If any bit is 0 →
xis definitely not in the set. - If all bits are 1 →
xis possibly in the set. Fall back to the authoritative check (e.g. a database lookup, a storage fetch, a slower data structure).
Why the asymmetry¶
A false negative would require a bit that was set during some
Insert(y) to have been cleared afterwards. Since insertion
only sets bits (never clears them), and the query uses the
same k hashes in the same positions, any member of the set
must still have all k bits set at query time. So false
negatives are impossible.
A false positive occurs when another element's k hashes
happen to collide with x's k hash positions, even though
x was never inserted. The probability of this grows with
the load factor of the bit array and falls exponentially
with k (up to the optimal k = (m/n) × ln 2).
Canonical Vercel disclosure¶
Vercel's 2026-04-21 blog post canonicalised the full parameter vector in a production file format:
{"version":"test","bloom":{"n":10,"p":1e-7,"m":336,"k":23,"s":0}}
"0kxC4anU4awVOYSs54vsAL7gBNGK/PrLjKrAJRil64mMxmiig1S+jqyC"
Line 1 is the JSON object with parameters (n, p, m,
k, s); line 2 is the Base64-encoded bit array. (Source:
sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters.)
Vercel uses this shape to pass the filter from their build
service (which generates it from the full path list) to their
routing service (which queries it on every request).
Trade-offs vs alternatives¶
| Structure | Membership | Insertion | Deletion | False pos | False neg | Size |
|---|---|---|---|---|---|---|
| Hash set | exact | yes | yes | none | none | ~(n × avg_key_bytes) |
| B-tree / trie | exact | yes | yes | none | none | ~(n × avg_key_bytes) |
| Bloom filter | probabilistic | yes | no | tunable (p) | none | ~(-n × log₂p / ln 2) bits |
| Counting Bloom filter | probabilistic | yes | yes (via counters) | tunable | none | ~4× bits |
| Cuckoo filter | probabilistic | yes | yes | tunable | none | slightly smaller at same p |
| Xor filter | probabilistic | batch only | no | tunable | none | ~20 % smaller than Bloom |
Use a Bloom filter when:
- You can tolerate false positives (the fallback is cheap).
- You cannot tolerate false negatives (they're structural bugs downstream — missed 404s, SEO breakage, data-loss claims, etc.).
- The authoritative set is too large to fit in memory but membership queries are on the hot path.
- Insertions are streaming / incremental; deletions are rare or batch-rebuild is acceptable.
Use a Cuckoo filter instead when deletion is a first-class operation (e.g. maintaining an eviction set).
Use a xor filter instead when the set is immutable and you can batch-build (e.g. per-deployment artefacts like Vercel's path list; Vercel could migrate from Bloom to xor for another 20 % size reduction).
Applications¶
- Negative caches / 404 filters: return a fast negative without touching storage. Vercel's canonical case.
- Column-store / LSM pruning: Apache Parquet, RocksDB, BigQuery, ClickHouse all use Bloom filters per SST / per row group to skip blocks that can't contain a predicate's target. See concepts/fragment-pruning.
- Web caches: Google Chrome shipped a Bloom filter of known-malicious URLs so browsers could check locally before confirming with the server.
- CDN / DNS negative caches: before hitting authoritative DNS, check a filter of known-nonexistent names.
- Distributed systems anti-entropy: compact membership digests exchanged between replicas.
- Deduplication: see Google's BigTable, Cassandra, HBase for per-SSTable Bloom filters.
- Networking: routing-table approximations, content- addressable routing.
Size intuition¶
For n = 1 million paths and p = 1e-7:
m = -(10⁶ × ln(10⁻⁷)) / (ln 2)²
= -(10⁶ × -16.12) / 0.4805
≈ 33.5 million bits
≈ 4.2 MB (uncompressed)
k = (33.5e6 / 10⁶) × ln 2 ≈ 23 hash functions
For the same set as a JSON tree of paths (~20 chars average × 10⁶ paths): ~20 MB plus tree overhead. Bloom filter wins by 4–5× on size even before Base64 encoding overhead.
Pitfalls¶
- Over-sizing
mwastes memory; under-sizing inflatespcatastrophically (the FP rate grows faster than linearly past the design capacity). - Too few hash functions → under-decoration of the bit array, higher FP rate.
- Too many hash functions → the array saturates (many bits set to 1) faster than it should, collision rate spikes.
- Inserting beyond capacity
n: FP rate diverges from the design target. At1.5 × n,proughly doubles. If the set is incremental and growing, scalable Bloom filters (chains of filters with increasingm) preserve targetpas capacity grows. - Not hashing path-normalised inputs: querying
/foo/after inserting/fooproduces a false negative from the caller's perspective (structural, not FP) if normalisation is inconsistent between insert and query. - No intrinsic support for deletion: see counting Bloom / cuckoo filters.
- Cross-language compatibility: the hash choice, bit indexing, and seed semantics must all match byte-for-byte between producer and consumer. Vercel implemented matching Bloom-filter logic in both the build service and routing service's codebases rather than relying on a shared library (Source: sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters).
Seen in¶
-
sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters — Canonical wiki introduction. Vercel's global routing service replaces a 1.5+ MB JSON path-tree lookup with a Bloom filter to answer "does this path exist in this deployment?" before serving or 404-ing. Path-lookup p99 drops 200× (~100 ms → ~0.5 ms); p999 drops 100× (~250 ms → ~2.4 ms); routing-service heap / memory drops 15 %; aggregate TTFB p75–p99 improves 10 %. The Bloom filter ships to the routing service as JSONL: line 1 = parameters (
n,p,m,k,s); line 2 = Base64-encoded bit array. Canonical false-positive / false-negative asymmetry worked example: "If the Bloom filter says a path does not exist, we can safely return a 404; if it says a path might exist, we fall back to checking the build outputs. ... We can't afford false negatives (returning 404 for valid pages), and Bloom filters guarantee this won't happen." -
sources/2026-06-03-netflix-dynamically-splitting-wide-partitions-in-cassandra-for-time-series-workloads — Bloom filter as a read-path divert gate for partition splitting. Netflix TimeSeries Abstraction uses an in-memory Bloom filter on every read to decide whether to consult the
wide_rowmetadata table (split exists) or read from the original partition (no split). The filter is loaded periodically with partition keys of COMPLETED splits. Bloom-filter check cost: single-digit microseconds, "making this diversion practically invisible to the callers." Memory footprint is monitored: "the filters fit comfortably in each server instance" due to compactness of partition keys plus low ratio of split partitions. Same false-positive / no-false- negative property is load-bearing here as in the Vercel case: a filter miss can safely bypass the metadata lookup; a false-positive only costs the metadata round trip (which returns no entry and the read proceeds against the original — see patterns/keep-original-partition-as-fallback-during-split). Canonicalised as patterns/bloom-filter-redirect-to-split-partition. -
— Machine-learning feature store as a Bloom filter. Zalando's benchmark reframes the bloom filter as a lossy compressor for historical features: composite keys
f"{user}^{article}"go into one in-process filter; at serving time the ranker enumerates all possible articles and probes the filter per user. 30× memory reduction at lossless AUC (470 MB sketch vs 15 GB conventional KV feature store; AUC 0.7997 vs 0.80); 330× reduction at 90 % of the historical-feature AUC uplift (40 MB, AUC 0.79). False positives flip binary features 0 → 1, absorbed by the classifier as input noise rather than misclassification — the concepts/false-positive-vs-false-negative-asymmetry property is load-bearing. Disqualifying limitations: every serving node must absorb 100 % of write traffic (no partitioning without rebuilding a distributed database), individual deletion is impossible (expiry forces full rebuild), and feature vocabularies must be bounded and enumerable. Introduces the sketching feature store as a third substrate adjacent to online and offline feature stores. -
patterns/trimmed-automaton-predicate-filter — Datadog's Husky-columnar-predicate-pruning pattern; lists Bloom filters as the probabilistic-filter sibling of trimmed automata for skipping SST blocks in indexed log queries. Same no-false-negatives contract, different input shape (regex / FSA vs hashed equality).
-
patterns/two-stage-evaluation — General two-stage filter-then-verify pattern; Bloom filters named as the canonical stage-1 fast-negative for the stage-2 slower authoritative check.
-
patterns/shadow-migration — Bloom-filter-as-migration- parity check: both old and new systems generate a Bloom filter of their output set; parity check verifies both sides produced the same membership before cutover.
-
concepts/fragment-pruning — Columnar storage: per- column Bloom filters let the query engine skip row groups that can't contain any matching predicate value.