Skip to content

CONCEPT Cited by 1 source

Sketching feature store

Definition

A sketching feature store is a machine-learning feature store whose backing storage is a probabilistic data structure — most commonly a Bloom filter — that lives in the serving process's memory rather than in an external key- value database. Features are encoded as composite keys (e.g. f"{user_id}^{article_id}"), inserted into the sketch as set-membership entries, and at query time the serving code enumerates the known feature value space and probes the sketch for each candidate, returning the set of features that the sketch answers "possibly present" for.

The substrate trades exact recall for sublinear space — a ~30× memory reduction at essentially indistinguishable model accuracy for a click-prediction workload, per Zalando's benchmark (Source: sources/2021-10-04-zalando-space-efficient-machine-learning-feature-stores-using-probabilistic-data-structures).

Mechanism

class FeatureStore:
    def __init__(self, store: BloomFilter):
        self.store = store
        self.possible_articles = set()

    def add(self, user_id: int, article_ids: Set[int]) -> None:
        for article_id in article_ids:
            self.possible_articles.add(article_id)
            composite_key = f'{user_id}^{article_id}'
            self.store.add(composite_key)

    def retrieve_articles(self, user_id: int) -> Set[int]:
        ret = set()
        for article_id in self.possible_articles:
            composite_key = f'{user_id}^{article_id}'
            if self.store.might_contain(composite_key):
                ret.add(article_id)
        return ret

Three load-bearing elements:

  1. Composite key encoding. The sketch is a set of (entity, feature-value) pairs, not a map from entity to feature vector. Any higher-arity feature (e.g. "bought X together with Y") encodes as f"{user_id}^{X}^{Y}" and costs one sketch entry.
  2. Auxiliary possible_features set. Reconstruction requires iterating all possible values of the feature axis (all article IDs, all categories, etc.). This set lives alongside the sketch in the same process; its size is independent of the number of users but proportional to the feature vocabulary.
  3. Probe-then-filter retrieval. The user's feature vector is reconstructed at read time by enumerating possible_features and emitting each one for which the sketch answers "possibly present." False positives promote features to 1 that should be 0; false negatives never occur.

Three substrates of ML feature storage

Substrate Latency Freshness Space Query flexibility When to use
External KV store (online feature store) 2–10 ms + tail Seconds Linear (all users × all features) Full — any query, any update Default for production feature stores; freshness matters
Offline store (S3/parquet) Seconds–minutes Hourly–daily Linear, but cheap $/byte Full — scans, filters, joins Training, batch inference, debugging, audit
Sketching feature store (this page) Nanoseconds–microseconds (in-process) Low-frequency batch only Sublinear (~3–0.3% of KV) Membership only Per-request many-lookups patterns where freshness isn't load-bearing

Benchmarked trade-off shape

Zalando's click-prediction benchmark (5.7 M examples, 2.8 M with historical data, 1.762 B feature values post-extraction) produced three variants fed into a logistic-regression click classifier:

  • No history (request features only): AUC ≈ 0.70.
  • Uncompressed history (conventional KV feature store): AUC ≈ 0.80, ~15 GB state.
  • Bloom-filter-backed sketch: AUC ≈ 0.7997 at 470 MB (3% of KV); AUC ≈ 0.79 at 40 MB (0.3% of KV, retaining 90% of the AUC uplift).

Throughput for all three variants: 20–22 k predictions/sec /core — the possible_articles iteration is not a measurable overhead at this vocabulary size. (Source: sources/2021-10-04-zalando-space-efficient-machine-learning-feature-stores-using-probabilistic-data-structures.)

Why the noise is harmless

A bloom-filter false positive flips a binary feature from 0 → 1 when it should have been 0. Unlike mislabelled training data, this appears to the model as noise in the input features — a well-studied regime that most ML models, including the logistic-regression classifier in the benchmark, absorb gracefully. The concepts/false-positive-vs-false-negative-asymmetry is load-bearing: bloom filters have no false negatives, so features that were present stay 1, and the classifier never loses a positive signal.

Where the substrate breaks

  1. Unbounded feature vocabularies. If possible_features is not enumerable (e.g. bag-of-words on free text), reconstruction is not possible — you cannot probe all possible bigrams. Applicability is scoped to feature spaces with a known, bounded vocabulary.
  2. High update frequency. Because the full state must live in every serving node's RAM, every write must be applied on every node — see concepts/feature-store-freshness.
  3. Individual deletion. Sketches do not support deletion by construction; expiry requires full rebuild. "Remove events older than 30 days" forces low-frequency batch cadence.
  4. Complex queries. "Give me all events from day X" requires metadata that defeats the space win. The sketch only answers set-membership.

Adjacent but distinct

Seen in

Last updated · 550 distilled / 1,221 read