Skip to content

ZALANDO 2021-10-04

Read original ↗

Zalando — Space-efficient machine-learning feature stores using probabilistic data structures: a benchmark

Summary

Zalando (2021-10-04) publishes a benchmark arguing that a large class of machine-learning feature stores — the external key-value stores (Redis-class) that enrich recommender-system requests with per-user historical features — can be replaced by an in-process feature store backed by a Bloom filter (or another sketching data structure) at roughly a 30× memory reduction for the same predictive accuracy. The win comes from trading exact recall of historical user-item interactions for probabilistic membership queries with tunable false-positive rate, which the downstream model absorbs as harmless input noise.

Concretely, on a real-life click-prediction dataset (5.7 M examples, 2.8 M with historical data, 1.762 B data points post-extraction), the team compared three variants: no history (request features only; AUC ~0.70), uncompressed history (conventional KV-style feature store; AUC ~0.80, ~15 GB state), and compressed history (bloom-filter-backed feature store; AUC ~0.7997 at 470 MB — 3% of the uncompressed size). Throughput was indistinguishable across variants (20–22 k predictions/sec/core). At a coarser compression setting, 90% of the AUC uplift from history was retained at 40 MB — 0.3% of the uncompressed footprint.

The post is balanced — it names three severe limitations that disqualify the substitution for many real feature stores: (1) callers must enumerate all possible feature values to reconstruct a user's interaction set (e.g. all known article IDs), which is prohibitive for unbounded feature spaces like bag-of-words; (2) updates are fundamentally hard because every node's in-process state must absorb 100% of event traffic (no partition-then-fan-out), and deletion of individual records is impossible — expiry requires full rebuild from source; (3) complex queries like "remove all events from day X" are infeasible without invasive metadata that defeats the space win. The conclusion is scoped: use the sketch-based feature store when the alternatives (external KV store, many lookups per request) are prohibitive; otherwise keep the conventional store.

Key takeaways

  1. Three costs of an external feature store they're pricing out. An external KV store adds 2–10 ms + tail variance per request (network hop), requires hosting + maintenance + backfill complexity, is an extra dev dependency that forces pre-population for tests, and collapses under many-lookups-per-request patterns like "retrieve features for 1,000 candidate products on one scoring call." Per-request composite-key lookups ("how many times were product X and Y bought together?") amplify the cost further. (Source: verbatim from the five-bullet "challenges" list.)
  2. The substitution: in-process Bloom filter indexed by composite keys. The benchmarked store stores f"{user_id}^{article_id}" composite keys in a single Bloom filter per node. At query time, the caller iterates over the stored self.possible_articles set and store.might_contain(composite_key) each one; the surviving articles are returned as the user's interaction set. "This may appear to be a very expensive thing to do, but in practice it is very cheap in relation to the total processing. In my simple benchmark, the difference was undetectable." Can be optimised with binary search or important-feature subsetting. (Source: the 11-line Python FeatureStore class sketch.)
  3. Quality↔size is a tunable dial via Bloom filter false- positive rate. Bloom filters answer "is this key present?" with no false negatives — when they say absent, they are always correct — but a tunable probability of false positives when they say present. A false positive flips a historical feature from 0 → 1 when it should have been 0, adding noise to the model input, and the classifier absorbs this as signal degradation rather than misclassification. Smaller state = higher false-positive rate = more noise = lower AUC. (Source: "at some probability, we will mistakenly set the feature value to 1, when in fact it should have been 0 (i.e. false positive). This adds noise to our model's input.")
  4. Benchmarked result: 30× memory reduction at lossless AUC. At AUC ≈ 0.7997 (essentially matching the 15 GB uncompressed variant's 0.80), the bloom-filter-backed store used 470 MB~3% of the uncompressed footprint. The state size can be pushed to 40 MB (0.3% of uncompressed) while still retaining 90% of the AUC uplift history provides. Throughput across all three variants was 20–22 k predictions/sec/core on a 2018 Mac — the overhead of iterating possible_articles was not measurable. (Source: the scatter plot + verbatim numerical summary.)
  5. Freshness is the fundamental blocker — not CPU or memory. A feature store's freshness (how quickly a just-occurred event appears in the store) is often the load-bearing product property, because recent events are high-signal for recommendations. Bloom filters do support incremental appends, but because the full state must live in every serving node's RAM, every write must be applied on every node — each node must handle 100% of write traffic. For real event streams (views, clicks at scale) this is not feasible. "Theoretically bloom filters could be distributed so that each node only needs to process a shard of the traffic — but at this point one would have converted one's real-time transaction server into a distributed database." See concepts/feature-store-freshness. (Source: entire "difficult to update" section.)
  6. Deletion is impossible — expiry requires full rebuild. Sketching data structures cannot delete individual records by construction (the bits can be shared with other keys). To expire "events older than 30 days" the entire state must be regenerated from the source dataset, excluding the expired events. This restricts refresh to low- frequency batch cadence. "There are some sketching-data- structure variants that allow some degree of expiry (see e.g. Age-partitioned Bloom Filters), but there are no mature implementations available."
  7. Known-feature-set requirement limits applicability. Because reconstruction iterates "all possible feature values", the feature space must be bounded and enumerable at serving time. This is cheap for "did the user look at article X?" (where the set of articles is known) but collapses for unbounded spaces like bag-of- words on review text. "In some situations it may be prohibitively expensive (e.g. imagine reconstructing bag-of-word encoding of past user reviews)."
  8. Scoped conclusion: replace the external store only when its costs become prohibitive. The post explicitly frames this as not a general replacement — the conventional online feature store stays the right answer when freshness or complex-query support matters. Use sketching feature stores when an external network hop is unaffordable, or when many feature lookups must be performed per request (ranking 1,000 candidates, composite-key interaction features). (Source: the "Conclusion" section's two-bullet scoping.)

Extracted systems

  • No Zalando-internal system is named — the post describes a benchmark harness on a real click-prediction dataset. The Python code sample uses a generic bloom_filter import (the bloom_filter2-style API); the benchmark itself was JVM-based and more general (arbitrary categorical features, not just article lookups).

Extracted concepts

Extracted patterns

Operational numbers

  • Training dataset: 5.7 M examples; 2.8 M with historical data; 1.762 B data points post-feature-extraction.
  • Uncompressed feature-store footprint: ~15 GB.
  • Bloom-filter feature-store footprint, lossless AUC: 470 MB (3% of uncompressed); AUC ≈ 0.7997 vs 0.80 uncompressed.
  • Bloom-filter feature-store footprint, 90% of AUC uplift: 40 MB (0.3% of uncompressed); AUC ≈ 0.79.
  • AUC floor (no-history baseline): ≈ 0.70.
  • Throughput across all variants: 20–22 k predictions/sec/core on a 2018 MacBook (no measurable overhead from possible_articles iteration).
  • Network-hop cost absorbed by the substitution: 2–10 ms per request + high tail variance on the conventional KV-store path.
  • Implementation language: benchmark was JVM-based (more general than the Python sketch in the post, which is illustrative only).

Caveats

  • Benchmark, not production deployment. The post is a benchmark against Zalando's canonical click-prediction dataset; there is no claim that Zalando replaced a production feature store with this substitution. The post reads as an in-house capability memo establishing where the sketch-based alternative dominates.
  • 2021-era post, one-author framing. Written by one author (Masato Asahara, ML Engineer); no co-author review visible; single benchmark environment (one 2018 Mac); no production latency SLOs referenced. Treat the numbers as directional, not a rigorous production benchmark.
  • Feature space must be enumerable. The known-possible-features requirement silently restricts this pattern to a subset of feature stores. Unbounded feature spaces (free-text, bag-of-words on reviews) do not fit.
  • Freshness cliff is a hard wall, not a soft limit. The post is blunt: once you shard bloom filters to handle write load, you've rebuilt a distributed database from scratch. The substitution only wins when a low-frequency batch rebuild of the state is acceptable.
  • Alternate sketches named but not benchmarked. The post lists HyperLogLog, Count-Min Sketch, and Quotient Filters as theoretical candidates, but only bloom filters are benchmarked. Count-Min Sketch in particular would give count estimates per composite key instead of presence, which is richer signal — unbenchmarked here.
  • No comparison to lightweight alternatives. No comparison to, say, a perfect minimal hash on a stable key set, or to column-oriented compression of a user-item matrix. The comparison is strictly bloom-filter-backed in-process vs external KV store.

Source

Last updated · 550 distilled / 1,221 read