CONCEPT Cited by 1 source
Bloom filter¶
Definition¶
A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It returns:
- Definitely not in the set (a true negative), or
- Possibly in the set (which may be a false positive).
It never returns definitely in the set and never returns a false negative. That asymmetry is the load-bearing property that makes Bloom filters useful as the pre-filter in front of a slower authoritative check.
Mechanism¶
A Bloom filter is specified by four parameters and one seed:
n— expected number of elements inserted.p— desired false-positive rate (e.g.1e-7for one FP per 10 million negative queries).m— size of the underlying bit array (derived fromnandp).k— number of independent hash functions (derived frommandn).s— seed / salt used to vary the first hash function (the remaining k-1 hashes are typically derived by adding offset constants or composing with counter values).
The formulas:
The bit array is initialised to all zeros.
Insert(x): compute k hashes of x, each modulo m, and
set each corresponding bit in the array to 1.
Query(x): compute the same k hashes of x, modulo m,
and check each bit.
- If any bit is 0 →
xis definitely not in the set. - If all bits are 1 →
xis possibly in the set. Fall back to the authoritative check (e.g. a database lookup, a storage fetch, a slower data structure).
Why the asymmetry¶
A false negative would require a bit that was set during some
Insert(y) to have been cleared afterwards. Since insertion
only sets bits (never clears them), and the query uses the
same k hashes in the same positions, any member of the set
must still have all k bits set at query time. So false
negatives are impossible.
A false positive occurs when another element's k hashes
happen to collide with x's k hash positions, even though
x was never inserted. The probability of this grows with
the load factor of the bit array and falls exponentially
with k (up to the optimal k = (m/n) × ln 2).
Canonical Vercel disclosure¶
Vercel's 2026-04-21 blog post canonicalised the full parameter vector in a production file format:
{"version":"test","bloom":{"n":10,"p":1e-7,"m":336,"k":23,"s":0}}
"0kxC4anU4awVOYSs54vsAL7gBNGK/PrLjKrAJRil64mMxmiig1S+jqyC"
Line 1 is the JSON object with parameters (n, p, m,
k, s); line 2 is the Base64-encoded bit array. (Source:
sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters.)
Vercel uses this shape to pass the filter from their build
service (which generates it from the full path list) to their
routing service (which queries it on every request).
Trade-offs vs alternatives¶
| Structure | Membership | Insertion | Deletion | False pos | False neg | Size |
|---|---|---|---|---|---|---|
| Hash set | exact | yes | yes | none | none | ~(n × avg_key_bytes) |
| B-tree / trie | exact | yes | yes | none | none | ~(n × avg_key_bytes) |
| Bloom filter | probabilistic | yes | no | tunable (p) | none | ~(-n × log₂p / ln 2) bits |
| Counting Bloom filter | probabilistic | yes | yes (via counters) | tunable | none | ~4× bits |
| Cuckoo filter | probabilistic | yes | yes | tunable | none | slightly smaller at same p |
| Xor filter | probabilistic | batch only | no | tunable | none | ~20 % smaller than Bloom |
Use a Bloom filter when:
- You can tolerate false positives (the fallback is cheap).
- You cannot tolerate false negatives (they're structural bugs downstream — missed 404s, SEO breakage, data-loss claims, etc.).
- The authoritative set is too large to fit in memory but membership queries are on the hot path.
- Insertions are streaming / incremental; deletions are rare or batch-rebuild is acceptable.
Use a Cuckoo filter instead when deletion is a first-class operation (e.g. maintaining an eviction set).
Use a xor filter instead when the set is immutable and you can batch-build (e.g. per-deployment artefacts like Vercel's path list; Vercel could migrate from Bloom to xor for another 20 % size reduction).
Applications¶
- Negative caches / 404 filters: return a fast negative without touching storage. Vercel's canonical case.
- Column-store / LSM pruning: Apache Parquet, RocksDB, BigQuery, ClickHouse all use Bloom filters per SST / per row group to skip blocks that can't contain a predicate's target. See concepts/fragment-pruning.
- Web caches: Google Chrome shipped a Bloom filter of known-malicious URLs so browsers could check locally before confirming with the server.
- CDN / DNS negative caches: before hitting authoritative DNS, check a filter of known-nonexistent names.
- Distributed systems anti-entropy: compact membership digests exchanged between replicas.
- Deduplication: see Google's BigTable, Cassandra, HBase for per-SSTable Bloom filters.
- Networking: routing-table approximations, content- addressable routing.
Size intuition¶
For n = 1 million paths and p = 1e-7:
m = -(10⁶ × ln(10⁻⁷)) / (ln 2)²
= -(10⁶ × -16.12) / 0.4805
≈ 33.5 million bits
≈ 4.2 MB (uncompressed)
k = (33.5e6 / 10⁶) × ln 2 ≈ 23 hash functions
For the same set as a JSON tree of paths (~20 chars average × 10⁶ paths): ~20 MB plus tree overhead. Bloom filter wins by 4–5× on size even before Base64 encoding overhead.
Pitfalls¶
- Over-sizing
mwastes memory; under-sizing inflatespcatastrophically (the FP rate grows faster than linearly past the design capacity). - Too few hash functions → under-decoration of the bit array, higher FP rate.
- Too many hash functions → the array saturates (many bits set to 1) faster than it should, collision rate spikes.
- Inserting beyond capacity
n: FP rate diverges from the design target. At1.5 × n,proughly doubles. If the set is incremental and growing, scalable Bloom filters (chains of filters with increasingm) preserve targetpas capacity grows. - Not hashing path-normalised inputs: querying
/foo/after inserting/fooproduces a false negative from the caller's perspective (structural, not FP) if normalisation is inconsistent between insert and query. - No intrinsic support for deletion: see counting Bloom / cuckoo filters.
- Cross-language compatibility: the hash choice, bit indexing, and seed semantics must all match byte-for-byte between producer and consumer. Vercel implemented matching Bloom-filter logic in both the build service and routing service's codebases rather than relying on a shared library (Source: sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters).
Seen in¶
-
sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters — Canonical wiki introduction. Vercel's global routing service replaces a 1.5+ MB JSON path-tree lookup with a Bloom filter to answer "does this path exist in this deployment?" before serving or 404-ing. Path-lookup p99 drops 200× (~100 ms → ~0.5 ms); p999 drops 100× (~250 ms → ~2.4 ms); routing-service heap / memory drops 15 %; aggregate TTFB p75–p99 improves 10 %. The Bloom filter ships to the routing service as JSONL: line 1 = parameters (
n,p,m,k,s); line 2 = Base64-encoded bit array. Canonical false-positive / false-negative asymmetry worked example: "If the Bloom filter says a path does not exist, we can safely return a 404; if it says a path might exist, we fall back to checking the build outputs. ... We can't afford false negatives (returning 404 for valid pages), and Bloom filters guarantee this won't happen." -
patterns/trimmed-automaton-predicate-filter — Datadog's Husky-columnar-predicate-pruning pattern; lists Bloom filters as the probabilistic-filter sibling of trimmed automata for skipping SST blocks in indexed log queries. Same no-false-negatives contract, different input shape (regex / FSA vs hashed equality).
-
patterns/two-stage-evaluation — General two-stage filter-then-verify pattern; Bloom filters named as the canonical stage-1 fast-negative for the stage-2 slower authoritative check.
-
patterns/shadow-migration — Bloom-filter-as-migration- parity check: both old and new systems generate a Bloom filter of their output set; parity check verifies both sides produced the same membership before cutover.
-
concepts/fragment-pruning — Columnar storage: per- column Bloom filters let the query engine skip row groups that can't contain any matching predicate value.