Skip to content

CONCEPT Cited by 1 source

Precomputed predictions API

A precomputed predictions API is the ML-serving pattern where a scheduled batch job fills a low-latency key-value store with all (or most) predictions ahead of time, and a thin API service reads the KV at request time. It trades freshness for the lowest possible latency and operationally simple high availability at global scale (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).

When it's the right choice

Netflix's MLP team names the criterion directly:

"Not all API-based deployments require real-time evaluation, which we cover in the section below. We have a number of business-critical applications where some or all predictions can be precomputed, guaranteeing the lowest possible latency and operationally simple high availability at the global scale."

Use it when:

  • Predictions don't need to reflect per-request state (e.g. a score attached to a (title, region) pair).
  • The input space is small enough to enumerate — or the interesting subset is.
  • Freshness requirements are hours or a day, not seconds.

Don't use it when:

  • Input space is unbounded or per-user / per-session.
  • Freshness needs are realtime (see the real-time counterpart Metaflow Hosting).

Shape

  scheduled       low-latency KV      thin API svc
    job        ────────────────►       (lookup +
  (compute)                             aggregate +
                                        return JSON)

The scheduled job writes results in parallel (using foreach at Netflix). The KV backend at Netflix is their internal caching infra; the generic pattern works identically on ElastiCache or DynamoDB.

The API svc doesn't just echo the cached value — it may aggregate across multiple keys, filter by request parameters, and return a JSON blob. Caching pushes the heavy computation out of the request path; the API handles just the request-specific composition.

Example: content-performance visualisation

Netflix's canonical case: a daily Metaflow job computes aggregate performance metrics in parallel, writes them via metaflow.Cache to an online KV store, and a metaflow.Hosting service looks values up and returns JSON to a Streamlit dashboard where decision-makers dynamically change visualisation parameters.

Failure modes

  • Stale predictions during the window between scheduled runs — mitigated by shortening the cadence or running hot + cold jobs.
  • Cache-backing-store failure — global availability depends on cache availability; multi-region replication is typically required for "operationally simple high availability at global scale."
  • Uncovered keys — when a request hits a key the precompute missed. Either return a sentinel, fall through to an on-demand compute path (see concepts/on-demand-feature-compute), or fail closed.

Seen in

Last updated · 550 distilled / 1,221 read