PATTERN Cited by 1 source
Bloom-filter membership test before storage fetch¶
Pattern¶
Put a Bloom filter membership test on the hot request path in front of a slower authoritative store. The filter answers "does this key exist?" in sub- microseconds with the error asymmetry "false positives but never false negatives":
- Filter says no → definitely doesn't exist → return
negative response (
404, cache miss, absent) immediately with no storage touch. - Filter says maybe → might exist → fall back to the authoritative store; it confirms presence (hit) or absence (false positive; still 404).
The pattern composes with any authoritative store whose "does this exist" check is expensive: object storage, a database, a remote service, or a slower on-disk data structure.
Why it works¶
Three load-bearing properties of Bloom filters make this pattern win:
- No false negatives: every real key is guaranteed to pass the filter. So the filter is safe to use as an authoritative "no" for non-existent keys — you won't wrongly 404 a real page.
- Bounded false-positive rate: configurable via
parameters
n/p/m/k. Atp = 1e-7, only one in ten million non-existent queries falls through to the fallback. The fallback's 404 is correct and cheap relative to the benefit. - Compact and constant-time queries: ~10 bits per
element at
p = 1 %; ~23 bits atp = 1e-7. Query cost is O(k) hash-then-bit-test, typically 10–30 operations total — independent of set size.
See concepts/false-positive-vs-false-negative-asymmetry for the design-level reasoning behind choosing a substrate whose error mode matches the cost asymmetry.
Canonical Vercel application¶
Vercel's 2026-04-21 blog post canonicalises the pattern at the edge-routing altitude:
"Before serving a request, the routing service consults this JSON file and checks if the path exists. If it doesn't, we immediately return a 404. This early check ensures we only fetch documents from storage when we're sure they exist. This prevents unnecessary requests to storage and protects against enumeration attacks, where attackers try to discover hidden files by guessing URLs."
(Source: sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters.)
The pre-existing substrate was a JSON-tree of every deployed path, parsed per routing-service instance on deployment update. The replacement is a Bloom filter at the same architectural position with the same contract — but 200× faster at p99 and 70–80 % smaller.
Canonical operational numbers¶
| Metric | Before (JSON tree) | After (Bloom filter) |
|---|---|---|
| Lookup p99 | ~100 ms | ~0.5 ms |
| Lookup p999 | ~250 ms | ~2.4 ms |
| File size (large tenant) | 1.5+ MB | 70–80 % smaller |
| Routing service heap / memory | baseline | −15 % |
| Aggregate TTFB p75–p99 | baseline | −10 % |
Architecture¶
Request arrives
│
▼
┌──────────────────┐
│ Bloom filter │ all bits == 1 ?
│ for deployment ├────────┬─── yes ─┐
│ path set │ │ │
└──────────────────┘ │ ▼
│ ┌─────────────┐
│ │ Storage │
│ │ fetch │
│ └─────┬───────┘
│ │
│ ┌────┴────┐
│ ▼ ▼
│ found? not found?
│ │ │
│ │ ▼
│ │ 404 (false positive case)
│ │
│ ▼
│ serve content
│
▼ any bit == 0 ?
404 (definite not-found; no storage touch)
Implementation choices¶
Where to put the filter¶
- In the request handler (Vercel's case): filter lives in-process on the routing node. Pro: sub-microsecond queries. Con: per-node memory overhead × number of deployments.
- In a side-car / local cache: reduces per-handler memory. Adds IPC cost.
- In a remote cache (Redis, Memcached): filter per tenant; handlers query over the network. Adds RTT; negates much of the pattern's value.
Filter lifecycle¶
- Build-time: filter is rebuilt from the full key set at every deployment / data-change event. Simple, authoritative. Vercel's approach.
- Runtime-incremental: filter is updated in place as
keys are inserted. Requires inserting into the filter
without false-negative risk — this is just
filter.insert(), safe — but deletion requires a counting filter or periodic rebuild. - Periodic rebuild: combine the best of both; incremental updates with periodic rebuild to bound saturation and FP rate.
Language / substrate coordination¶
If the filter producer and consumer run in different runtimes (Vercel's build service vs routing service), both sides must implement byte-for-byte-compatible Bloom-filter logic: same hash function, same bit-order, same parameter interpretation. Vercel explicitly chose to hand-implement matching logic in each codebase rather than rely on a shared library:
"The main challenge was coordinating Bloom filter implementations across two services written in different languages: Build service: Generates Bloom filter from all deployment paths. Routing service: Queries Bloom filter for incoming requests. Both needed identical Bloom filter logic to ensure compatibility. We implemented matching Bloom filter algorithms in both codebases."
Encoding and storage¶
Bloom filters are bit arrays, and naive bit-array storage
is awkward to embed in text-based configuration files
(JSON, YAML). The pattern composes naturally with
patterns/jsonl-parameters-plus-base64-payload: line 1 is
the filter parameters as JSON (n, p, m, k, s);
line 2 is the Base64-encoded bit array. Further composes with
concepts/base64-as-byte-buffer-inline for zero-allocation
per-query decode.
Adjacent altitudes of the pattern¶
The Vercel case is the edge-routing altitude. The same pattern at different altitudes:
- LSM / columnar storage pruning: RocksDB, Apache Parquet, ClickHouse. Per-SSTable / per-row-group Bloom filter skips blocks that can't contain a predicate's target. See patterns/trimmed-automaton-predicate-filter and concepts/fragment-pruning.
- Browser safe-browsing: Chrome shipped a Bloom filter of known-malicious URLs; browser checks the filter locally before confirming with the server.
- CDN negative caching: cache the fact that a URL doesn't exist (or has been deleted) so subsequent requests don't hit origin.
- Cache-aside for NoSQL: filter in-process, fallback to Cassandra / HBase.
- Distributed deduplication: filter in streaming pipelines to skip already-processed messages.
- DNS negative caching: filter of known-nonexistent names; check before authoritative DNS.
Anti-patterns¶
- Using the filter as authoritative for positive answers.
False positives are structural. A non-zero
pmeans some fraction of filter-says-yes queries will be wrong; always fall back to the authoritative store for positives. - Misusing a counting filter where a plain Bloom suffices. Counting filters are 4× the size; only worth it if deletion is required.
- Sharing one filter across unrelated key sets. Per-tenant
/ per-deployment filters have smaller
nper filter and so smallermper filter; one giant filter is wasteful and leaks cross-tenant FP rate. - Not tracking FP rate in production. The design
pis the expected FP rate at design capacity. If the set grows pastn, the rate spikes; add a metric that samples known- negative queries and measures actual FP rate. - Rebuilding the filter on the request path. Build-time only. If the filter must change per request, use a different substrate.
- Falling back to storage on every single request.
Defeats the pattern. Ensure the filter's FP rate is low
enough that the fallback is rare (at
p = 1e-7, fallback is roughly one per 10 M negative queries; atp = 1 %, fallback is one per 100 negative queries — usually too high for hot-path use).
Seen in¶
-
sources/2026-04-21-vercel-how-we-made-global-routing-faster-with-bloom-filters — Canonical wiki introduction at the edge-routing altitude. Vercel's global routing service uses a Bloom filter to answer "does this path exist in this deployment?" before fetching from storage. Path-lookup p99 drops 200× (~100 ms → ~0.5 ms); TTFB p75–p99 improves 10 % across all routed requests; routing-service heap drops 15 %.
-
patterns/trimmed-automaton-predicate-filter — Datadog's Husky columnar-log predicate pruning uses Bloom filters plus trimmed automata for regex predicates. Same no-FN guarantee; different query shape.
-
patterns/two-stage-evaluation — General pattern: cheap stage-1 filter + authoritative stage-2 check. Bloom-filter- then-verify is the canonical probabilistic instantiation.