CONCEPT Cited by 1 source
Precomputed predictions API¶
A precomputed predictions API is the ML-serving pattern where a scheduled batch job fills a low-latency key-value store with all (or most) predictions ahead of time, and a thin API service reads the KV at request time. It trades freshness for the lowest possible latency and operationally simple high availability at global scale (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix).
When it's the right choice¶
Netflix's MLP team names the criterion directly:
"Not all API-based deployments require real-time evaluation, which we cover in the section below. We have a number of business-critical applications where some or all predictions can be precomputed, guaranteeing the lowest possible latency and operationally simple high availability at the global scale."
Use it when:
- Predictions don't need to reflect per-request state (e.g. a score attached to a (title, region) pair).
- The input space is small enough to enumerate — or the interesting subset is.
- Freshness requirements are hours or a day, not seconds.
Don't use it when:
- Input space is unbounded or per-user / per-session.
- Freshness needs are realtime (see the real-time counterpart Metaflow Hosting).
Shape¶
scheduled low-latency KV thin API svc
job ────────────────► (lookup +
(compute) aggregate +
return JSON)
The scheduled job writes results in parallel (using foreach at
Netflix). The KV backend at Netflix is their internal caching
infra; the generic pattern works identically on
ElastiCache or
DynamoDB.
The API svc doesn't just echo the cached value — it may aggregate across multiple keys, filter by request parameters, and return a JSON blob. Caching pushes the heavy computation out of the request path; the API handles just the request-specific composition.
Example: content-performance visualisation¶
Netflix's canonical case: a daily Metaflow job computes aggregate
performance metrics in parallel, writes them via
metaflow.Cache to an online
KV store, and a
metaflow.Hosting service
looks values up and returns JSON to a
Streamlit dashboard where decision-makers
dynamically change visualisation parameters.
Failure modes¶
- Stale predictions during the window between scheduled runs — mitigated by shortening the cadence or running hot + cold jobs.
- Cache-backing-store failure — global availability depends on cache availability; multi-region replication is typically required for "operationally simple high availability at global scale."
- Uncovered keys — when a request hits a key the precompute missed. Either return a sentinel, fall through to an on-demand compute path (see concepts/on-demand-feature-compute), or fail closed.