SYSTEM Cited by 1 source
Netflix Model Serving Platform¶
Netflix Model Serving Platform is Netflix's centralized ML-model-serving infrastructure: a single platform through which all member-facing ML-powered experiences (title recommendations, fraud detection, commerce personalization, etc.) request model execution. As of 2025 it serves "hundreds of model types and versions, netting 1 million requests per second" across over 30 client services.
This is the parent system page. The 2026-05-01 post describes one layer of the stack — the routing layer. Future posts in the series will cover inference, feature fetching, and their interaction with routing.
Core design principles (2026-05-01 post)¶
- Model innovation independent of client apps. Clients integrate once per new use case; thereafter, model iterations (including intermediate A/B experiments) are mostly opaque. Model selection for a user's A/B-allocation, fact fetching, and logging happen inside the platform.
- Decouple clients from model sharding. Models run on compute clusters (shards), each with its own VIP address. Factors like traffic, SLAs, model architecture, CPU/memory affect the model-to-cluster mapping; clients are shielded from the resulting VIP churn via the VIP decoupling contract.
- Flexible traffic routing rules. A/B experiments, gradual traffic shifts, and client overrides are all expressible as configuration, not code.
Model definition at Netflix¶
At Netflix, "a model encapsulates pre- and post-processing, feature computation logic, and an optional ML-trained component, all packaged in a standard format suitable for use across multiple contexts." The platform runs the end-to-end execution, which the post calls model serving to distinguish from pure inference:
- Model inference:
infer(features) -> score. - Model serving: a workflow whose inputs are standard request context (userId, country, device) + domain context (titles to rank, payment transaction); the workflow internally fetches facts from adjacent microservices and from Netflix's ML fact store during execution.
Objective abstraction¶
The platform contract is expressed as an Objective — a platform-defined enumeration naming a business use case. See concepts/objective-abstraction.
Examples named in the 2026-05-01 post:
ContinueWatchingRanking— input: userId, country, deviceId; output: ranked list of titleIds for the Continue Watching row.- Payment Fraud Detection — input: userId, country, payment transaction details; output: fraud probability.
Objectives (a) identify the use case, (b) decouple clients from concrete models, and (c) drive platform-side routing / model- selection decisions.
Routing: Switchboard → Lightbulb evolution¶
The 2026-05-01 post is primarily about the routing layer. Two successive shapes are described:
- Switchboard — custom centralized gRPC proxy in the critical path of every inference request. Single point of contact for clients, Objective → model → VIP resolution, A/B test integration, shadow mode, canary, rollback. Scaled to 1M req/sec but cost 10–20ms of serialization tax and was a shared dependency whose failure could degrade many experiences.
- Lightbulb + Envoy — split the routing-metadata decision (Lightbulb) from the connection-routing decision (Envoy, data plane). Keeps the researcher-facing flexibility; removes the proxy from the payload path.
The pattern is canonicalised as patterns/separate-routing-from-model-selection.
Configuration: Switchboard Rules via Gutenberg¶
Model researchers author traffic-routing rules as JavaScript configurations (so-called Switchboard Rules), which compile to versioned JSON rule sets published via Netflix's Gutenberg dataset pub/sub system. Both the routing service (Switchboard historically; Lightbulb today) and the serving cluster hosts subscribe to the same rule stream — enabling independent release cycles for experiments decoupled from code deploys.
This is the patterns/config-separated-from-code-via-pubsub pattern; see also concepts/async-kafka-publication-for-telemetry for sibling pub/sub approaches at different altitudes.
What's not yet canonicalised (series to come)¶
The 2026-05-01 post explicitly names as future-post topics:
- Inference internals — what happens inside the serving cluster host after routing resolves the VIP.
- Feature fetching — how the fact store + facts API + online feature sources compose with the serving workflow.
- Interaction with routing — how routing, inference, and feature fetching combine end-to-end in the production critical path.
Relationship to other Netflix platforms¶
- Data Gateway + KV DAL — sibling abstraction posture at the storage tier; both expose a platform-owned contract that shields clients from backend churn while centralizing operator-side flexibility.
- Post-Training Framework — training-side counterpart; post-training uses thin library on OSS posture, routing uses Envoy-as-data-plane posture. Both avoid reinventing the generic substrate.
- MediaFM — a production consumer of the model-serving platform for multimodal media understanding.
- Netflix Experimentation Platform (referenced, not yet ingested on the wiki) — queried per-request to resolve a userId's cell allocation so the Objective → model → cell rule chain can fire.
Seen in¶
- sources/2026-05-01-netflix-state-of-routing-in-model-serving — first canonical wiki description of Netflix's model-serving platform. Articulates the 1M req/sec scale, the >30 client services, the Objective abstraction, the Switchboard → Lightbulb evolution, and the three platform principles.
Related¶
- systems/netflix-switchboard
- systems/netflix-lightbulb
- systems/netflix-gutenberg
- systems/envoy
- concepts/objective-abstraction
- concepts/model-serving-vs-model-inference
- concepts/vip-address-decoupling
- patterns/separate-routing-from-model-selection
- patterns/centralized-routing-proxy-for-ml-serving
- patterns/config-separated-from-code-via-pubsub
- companies/netflix