Skip to content

PATTERN Cited by 1 source

Precompute then API-serve

Compute predictions in a scheduled batch job. Store them in a low-latency KV store. Expose a thin API that does lookup + request-specific composition. This is the default ML-serving shape when the input space is bounded or hot-skewed and predictions don't need to reflect per-request state.

Problem

You want to serve ML predictions with:

  • Low latency — single-digit ms p50.
  • High availability at global scale — operationally simple, not dependent on a brittle online inference fleet.
  • Manageable cost — don't pay inference compute on every request.

But predictions are expensive to compute and you don't need real-time freshness.

Solution

    schedule         precomputed results         thin API svc
       │                                         (lookup + per-
       ▼                                          request compose
    ┌───────────────┐       write       ┌───────────────┐
    │  Scheduled    │──────────────────▶│  KV store     │
    │  batch job    │                   │  (cache tier) │
    │  (parallel    │                   └───────┬───────┘
    │   compute)    │                           │ read
    └───────────────┘                           ▼
                                        ┌───────────────┐
                                        │  API service  │──▶ client
                                        │  (thin,       │    (JSON)
                                        │   stateless)  │
                                        └───────────────┘

Three moving pieces:

  1. Scheduled batch job runs on a fixed cadence (typically daily). Computes predictions in parallel (foreach at Netflix), writes the full result set to a KV store.
  2. KV store holds the full prediction space. Backing technology is irrelevant — Netflix uses internal caching infra, but the same pattern works on ElastiCache or DynamoDB.
  3. Thin API service looks up values in the KV, does per- request composition (aggregation, filtering, formatting), and returns JSON. No ML compute in the request path.

The API service doesn't just echo a cached value. It handles the request-specific logic that wasn't worth precomputing — which parameters to aggregate, which dimension to slice on, which format to emit.

Canonical example — content-performance visualisation

Netflix's instance (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix):

  • Daily Metaflow job computes aggregate content-performance metrics in parallel via foreach.
  • Writes results via metaflow.Cache to an online KV store.
  • A Streamlit app hosts visualisation + interactivity.
  • On any user interaction, Streamlit sends a message to a simple metaflow.Hosting service that looks up values in the cache, performs request-specific computation, and returns JSON.

The Netflix MLP team calls this an "officially supported pattern" for such applications.

Trade-offs

Axis Precompute-then-API-serve Real-time serving
Latency single-digit ms (cache lookup) tens to hundreds of ms (inference)
Freshness cadence-bound (hours, days) per-request
Failure domain cache availability model fleet availability
Cost batch compute + cache storage per-request compute
Operational complexity scheduled job + cache online fleet + autoscaler

Use real-time (Metaflow Hosting) when per-request state matters; use precompute for the rest.

Failure modes

  • Stale results between runs — shorten cadence or run multiple staggered jobs.
  • Missing keys — fall through to a realtime path or return a sentinel.
  • Cache failure — everything stops. Multi-region replication of the cache is required for global HA.
  • Compute-job failures — stale data served until next run. SLOs on the job's success rate matter.

Seen in

Last updated · 550 distilled / 1,221 read