PATTERN Cited by 1 source
Precompute then API-serve¶
Compute predictions in a scheduled batch job. Store them in a low-latency KV store. Expose a thin API that does lookup + request-specific composition. This is the default ML-serving shape when the input space is bounded or hot-skewed and predictions don't need to reflect per-request state.
Problem¶
You want to serve ML predictions with:
- Low latency — single-digit ms p50.
- High availability at global scale — operationally simple, not dependent on a brittle online inference fleet.
- Manageable cost — don't pay inference compute on every request.
But predictions are expensive to compute and you don't need real-time freshness.
Solution¶
schedule precomputed results thin API svc
│ (lookup + per-
▼ request compose
┌───────────────┐ write ┌───────────────┐
│ Scheduled │──────────────────▶│ KV store │
│ batch job │ │ (cache tier) │
│ (parallel │ └───────┬───────┘
│ compute) │ │ read
└───────────────┘ ▼
┌───────────────┐
│ API service │──▶ client
│ (thin, │ (JSON)
│ stateless) │
└───────────────┘
Three moving pieces:
- Scheduled batch job runs on a fixed cadence (typically
daily). Computes predictions in parallel (
foreachat Netflix), writes the full result set to a KV store. - KV store holds the full prediction space. Backing technology is irrelevant — Netflix uses internal caching infra, but the same pattern works on ElastiCache or DynamoDB.
- Thin API service looks up values in the KV, does per- request composition (aggregation, filtering, formatting), and returns JSON. No ML compute in the request path.
The API service doesn't just echo a cached value. It handles the request-specific logic that wasn't worth precomputing — which parameters to aggregate, which dimension to slice on, which format to emit.
Canonical example — content-performance visualisation¶
Netflix's instance (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix):
- Daily Metaflow job computes aggregate content-performance metrics
in parallel via
foreach. - Writes results via
metaflow.Cacheto an online KV store. - A Streamlit app hosts visualisation + interactivity.
- On any user interaction, Streamlit sends a message to a simple
metaflow.Hostingservice that looks up values in the cache, performs request-specific computation, and returns JSON.
The Netflix MLP team calls this an "officially supported pattern" for such applications.
Trade-offs¶
| Axis | Precompute-then-API-serve | Real-time serving |
|---|---|---|
| Latency | single-digit ms (cache lookup) | tens to hundreds of ms (inference) |
| Freshness | cadence-bound (hours, days) | per-request |
| Failure domain | cache availability | model fleet availability |
| Cost | batch compute + cache storage | per-request compute |
| Operational complexity | scheduled job + cache | online fleet + autoscaler |
Use real-time (Metaflow Hosting) when per-request state matters; use precompute for the rest.
Failure modes¶
- Stale results between runs — shorten cadence or run multiple staggered jobs.
- Missing keys — fall through to a realtime path or return a sentinel.
- Cache failure — everything stops. Multi-region replication of the cache is required for global HA.
- Compute-job failures — stale data served until next run. SLOs on the job's success rate matter.
Seen in¶
- sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix — canonical Netflix instance (Metaflow Cache + Metaflow Hosting + Streamlit for content-performance viz).