CONCEPT Cited by 1 source
Sketching feature store¶
Definition¶
A sketching feature store is a machine-learning
feature store
whose backing storage is a
probabilistic data
structure — most commonly a
Bloom filter — that lives in the
serving process's memory rather than in an external key-
value database. Features are encoded as composite keys
(e.g. f"{user_id}^{article_id}"), inserted into the sketch
as set-membership entries, and at query time the serving
code enumerates the known feature value space and probes
the sketch for each candidate, returning the set of features
that the sketch answers "possibly present" for.
The substrate trades exact recall for sublinear space — a ~30× memory reduction at essentially indistinguishable model accuracy for a click-prediction workload, per Zalando's benchmark (Source: sources/2021-10-04-zalando-space-efficient-machine-learning-feature-stores-using-probabilistic-data-structures).
Mechanism¶
class FeatureStore:
def __init__(self, store: BloomFilter):
self.store = store
self.possible_articles = set()
def add(self, user_id: int, article_ids: Set[int]) -> None:
for article_id in article_ids:
self.possible_articles.add(article_id)
composite_key = f'{user_id}^{article_id}'
self.store.add(composite_key)
def retrieve_articles(self, user_id: int) -> Set[int]:
ret = set()
for article_id in self.possible_articles:
composite_key = f'{user_id}^{article_id}'
if self.store.might_contain(composite_key):
ret.add(article_id)
return ret
Three load-bearing elements:
- Composite key encoding. The sketch is a set of
(entity, feature-value)pairs, not a map from entity to feature vector. Any higher-arity feature (e.g. "bought X together with Y") encodes asf"{user_id}^{X}^{Y}"and costs one sketch entry. - Auxiliary
possible_featuresset. Reconstruction requires iterating all possible values of the feature axis (all article IDs, all categories, etc.). This set lives alongside the sketch in the same process; its size is independent of the number of users but proportional to the feature vocabulary. - Probe-then-filter retrieval. The user's feature
vector is reconstructed at read time by enumerating
possible_featuresand emitting each one for which the sketch answers "possibly present." False positives promote features to 1 that should be 0; false negatives never occur.
Three substrates of ML feature storage¶
| Substrate | Latency | Freshness | Space | Query flexibility | When to use |
|---|---|---|---|---|---|
| External KV store (online feature store) | 2–10 ms + tail | Seconds | Linear (all users × all features) | Full — any query, any update | Default for production feature stores; freshness matters |
| Offline store (S3/parquet) | Seconds–minutes | Hourly–daily | Linear, but cheap $/byte | Full — scans, filters, joins | Training, batch inference, debugging, audit |
| Sketching feature store (this page) | Nanoseconds–microseconds (in-process) | Low-frequency batch only | Sublinear (~3–0.3% of KV) | Membership only | Per-request many-lookups patterns where freshness isn't load-bearing |
Benchmarked trade-off shape¶
Zalando's click-prediction benchmark (5.7 M examples, 2.8 M with historical data, 1.762 B feature values post-extraction) produced three variants fed into a logistic-regression click classifier:
- No history (request features only): AUC ≈ 0.70.
- Uncompressed history (conventional KV feature store): AUC ≈ 0.80, ~15 GB state.
- Bloom-filter-backed sketch: AUC ≈ 0.7997 at 470 MB (3% of KV); AUC ≈ 0.79 at 40 MB (0.3% of KV, retaining 90% of the AUC uplift).
Throughput for all three variants: 20–22 k predictions/sec
/core — the possible_articles iteration is not a measurable
overhead at this vocabulary size. (Source: sources/2021-10-04-zalando-space-efficient-machine-learning-feature-stores-using-probabilistic-data-structures.)
Why the noise is harmless¶
A bloom-filter false positive flips a binary feature from 0 → 1 when it should have been 0. Unlike mislabelled training data, this appears to the model as noise in the input features — a well-studied regime that most ML models, including the logistic-regression classifier in the benchmark, absorb gracefully. The concepts/false-positive-vs-false-negative-asymmetry is load-bearing: bloom filters have no false negatives, so features that were present stay 1, and the classifier never loses a positive signal.
Where the substrate breaks¶
- Unbounded feature vocabularies. If
possible_featuresis not enumerable (e.g. bag-of-words on free text), reconstruction is not possible — you cannot probe all possible bigrams. Applicability is scoped to feature spaces with a known, bounded vocabulary. - High update frequency. Because the full state must live in every serving node's RAM, every write must be applied on every node — see concepts/feature-store-freshness.
- Individual deletion. Sketches do not support deletion by construction; expiry requires full rebuild. "Remove events older than 30 days" forces low-frequency batch cadence.
- Complex queries. "Give me all events from day X" requires metadata that defeats the space win. The sketch only answers set-membership.
Adjacent but distinct¶
- patterns/bloom-filter-membership-test-before-storage-fetch — the pre-filter pattern, where the Bloom filter fronts an authoritative store to prune 404 lookups. The sketching feature store is the degenerate case of that pattern where the filter is the store.
- concepts/online-vs-offline-feature-store — the mainstream two-substrate architecture this post proposes a third substrate for.
Seen in¶
- sources/2021-10-04-zalando-space-efficient-machine-learning-feature-stores-using-probabilistic-data-structures — canonical benchmark establishing the 30× memory reduction at lossless AUC.
Related¶
- concepts/feature-store-freshness — the axis that disqualifies this substrate for high-frequency update workloads.
- concepts/bloom-filter — the canonical sketch used.
- concepts/probabilistic-data-structure — the broader family (HyperLogLog, Count-Min, Quotient Filters — all theoretically applicable, unbenchmarked in the source).
- concepts/online-vs-offline-feature-store — the conventional two-substrate architecture.
- concepts/false-positive-vs-false-negative-asymmetry — why the noise is tolerable.
- patterns/probabilistic-feature-store-over-kv — the named substitution pattern.
- companies/zalando