Skip to content

CONCEPT Cited by 1 source

On-demand feature compute

On-demand feature compute is the feature-store shape where feature values are not precomputed up-front; instead, the store responds to a request by computing the feature dependency graph, enqueuing the missing pieces, caching the results once compute lands, and letting the caller fetch asynchronously. It's the counterpoint to concepts/precomputed-predictions-api.

When to use it

Use on-demand when precompute is infeasible. Netflix's canonical case is Amber, their feature store for media assets:

"While Amber is a feature store, precomputing and storing all media features in advance would be infeasible. Instead, we compute and cache features in an on-demand basis." (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix)

Infeasibility at Netflix's scale comes from:

  • High feature cardinality — dozens or hundreds of ML-derived features per media asset.
  • Large catalog — precomputing every feature for every asset crosses the storage / compute budget.
  • Changing model outputs — a feature recomputes when the model changes; blanket precompute invalidates constantly.

Mechanism

The typical shape (per Amber):

  1. Request arrives at the feature store for feature F on asset A.
  2. Store computes the feature dependency graph for F.
  3. For each missing node, send async request to a compute substrate (Metaflow Hosting at Netflix) that queues, schedules, and runs the feature-computation flow.
  4. Compute substrate caches the response.
  5. Caller fetches after a while — either by polling or by receiving a callback.

The key design choice: reuse an existing async-capable compute substrate instead of building a dedicated microservice. See patterns/async-queue-feature-on-demand.

Latency vs. storage trade-off

On-demand trades per-request latency (you pay for compute on the first call) for storage efficiency (you only store features that were actually requested). The cache that sits in front absorbs repeat requests.

In practice, many feature requests are concentrated on a small set of hot assets — so the cache hit rate is high after a warm-up, and the amortised latency is close to a lookup latency.

Failure modes

  • Cold cache stampede on a new asset — many callers pile up on the first request. Mitigated by request collapsing at the feature-store layer.
  • Dependency graph depth — deep graphs amplify first-request latency. Mitigated by eagerly caching common subgraphs.
  • Compute substrate backlog — async queue depth grows under load. Observability on the queue itself is required.

Seen in

Last updated · 550 distilled / 1,221 read