PATTERN Cited by 1 source

Precompute then API-serve¶

Compute predictions in a scheduled batch job. Store them in a low-latency KV store. Expose a thin API that does lookup + request-specific composition. This is the default ML-serving shape when the input space is bounded or hot-skewed and predictions don't need to reflect per-request state.

Problem¶

You want to serve ML predictions with:

Low latency — single-digit ms p50.
High availability at global scale — operationally simple, not dependent on a brittle online inference fleet.
Manageable cost — don't pay inference compute on every request.

But predictions are expensive to compute and you don't need real-time freshness.

Solution¶

    schedule         precomputed results         thin API svc
       │                                         (lookup + per-
       ▼                                          request compose
    ┌───────────────┐       write       ┌───────────────┐
    │  Scheduled    │──────────────────▶│  KV store     │
    │  batch job    │                   │  (cache tier) │
    │  (parallel    │                   └───────┬───────┘
    │   compute)    │                           │ read
    └───────────────┘                           ▼
                                        ┌───────────────┐
                                        │  API service  │──▶ client
                                        │  (thin,       │    (JSON)
                                        │   stateless)  │
                                        └───────────────┘

Three moving pieces:

Scheduled batch job runs on a fixed cadence (typically daily). Computes predictions in parallel (foreach at Netflix), writes the full result set to a KV store.
KV store holds the full prediction space. Backing technology is irrelevant — Netflix uses internal caching infra, but the same pattern works on ElastiCache or DynamoDB.
Thin API service looks up values in the KV, does per- request composition (aggregation, filtering, formatting), and returns JSON. No ML compute in the request path.

The API service doesn't just echo a cached value. It handles the request-specific logic that wasn't worth precomputing — which parameters to aggregate, which dimension to slice on, which format to emit.

Canonical example — content-performance visualisation¶

Netflix's instance (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix):

Daily Metaflow job computes aggregate content-performance metrics in parallel via foreach.
Writes results via metaflow.Cache to an online KV store.
A Streamlit app hosts visualisation + interactivity.
On any user interaction, Streamlit sends a message to a simple metaflow.Hosting service that looks up values in the cache, performs request-specific computation, and returns JSON.

The Netflix MLP team calls this an "officially supported pattern" for such applications.

Trade-offs¶

Axis	Precompute-then-API-serve	Real-time serving
Latency	single-digit ms (cache lookup)	tens to hundreds of ms (inference)
Freshness	cadence-bound (hours, days)	per-request
Failure domain	cache availability	model fleet availability
Cost	batch compute + cache storage	per-request compute
Operational complexity	scheduled job + cache	online fleet + autoscaler

Use real-time (Metaflow Hosting) when per-request state matters; use precompute for the rest.

Failure modes¶

Stale results between runs — shorten cadence or run multiple staggered jobs.
Missing keys — fall through to a realtime path or return a sentinel.
Cache failure — everything stops. Multi-region replication of the cache is required for global HA.
Compute-job failures — stale data served until next run. SLOs on the job's success rate matter.

Seen in¶

sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix — canonical Netflix instance (Metaflow Cache + Metaflow Hosting + Streamlit for content-performance viz).