Skip to content

SYSTEM Cited by 1 source

Netflix Model Serving Platform

Netflix Model Serving Platform is Netflix's centralized ML-model-serving infrastructure: a single platform through which all member-facing ML-powered experiences (title recommendations, fraud detection, commerce personalization, etc.) request model execution. As of 2025 it serves "hundreds of model types and versions, netting 1 million requests per second" across over 30 client services.

This is the parent system page. The 2026-05-01 post describes one layer of the stack — the routing layer. Future posts in the series will cover inference, feature fetching, and their interaction with routing.

Core design principles (2026-05-01 post)

  1. Model innovation independent of client apps. Clients integrate once per new use case; thereafter, model iterations (including intermediate A/B experiments) are mostly opaque. Model selection for a user's A/B-allocation, fact fetching, and logging happen inside the platform.
  2. Decouple clients from model sharding. Models run on compute clusters (shards), each with its own VIP address. Factors like traffic, SLAs, model architecture, CPU/memory affect the model-to-cluster mapping; clients are shielded from the resulting VIP churn via the VIP decoupling contract.
  3. Flexible traffic routing rules. A/B experiments, gradual traffic shifts, and client overrides are all expressible as configuration, not code.

Model definition at Netflix

At Netflix, "a model encapsulates pre- and post-processing, feature computation logic, and an optional ML-trained component, all packaged in a standard format suitable for use across multiple contexts." The platform runs the end-to-end execution, which the post calls model serving to distinguish from pure inference:

  • Model inference: infer(features) -> score.
  • Model serving: a workflow whose inputs are standard request context (userId, country, device) + domain context (titles to rank, payment transaction); the workflow internally fetches facts from adjacent microservices and from Netflix's ML fact store during execution.

Objective abstraction

The platform contract is expressed as an Objective — a platform-defined enumeration naming a business use case. See concepts/objective-abstraction.

Examples named in the 2026-05-01 post:

  • ContinueWatchingRanking — input: userId, country, deviceId; output: ranked list of titleIds for the Continue Watching row.
  • Payment Fraud Detection — input: userId, country, payment transaction details; output: fraud probability.

Objectives (a) identify the use case, (b) decouple clients from concrete models, and (c) drive platform-side routing / model- selection decisions.

Routing: Switchboard → Lightbulb evolution

The 2026-05-01 post is primarily about the routing layer. Two successive shapes are described:

  • Switchboard — custom centralized gRPC proxy in the critical path of every inference request. Single point of contact for clients, Objective → model → VIP resolution, A/B test integration, shadow mode, canary, rollback. Scaled to 1M req/sec but cost 10–20ms of serialization tax and was a shared dependency whose failure could degrade many experiences.
  • Lightbulb + Envoy — split the routing-metadata decision (Lightbulb) from the connection-routing decision (Envoy, data plane). Keeps the researcher-facing flexibility; removes the proxy from the payload path.

The pattern is canonicalised as patterns/separate-routing-from-model-selection.

Configuration: Switchboard Rules via Gutenberg

Model researchers author traffic-routing rules as JavaScript configurations (so-called Switchboard Rules), which compile to versioned JSON rule sets published via Netflix's Gutenberg dataset pub/sub system. Both the routing service (Switchboard historically; Lightbulb today) and the serving cluster hosts subscribe to the same rule stream — enabling independent release cycles for experiments decoupled from code deploys.

This is the patterns/config-separated-from-code-via-pubsub pattern; see also concepts/async-kafka-publication-for-telemetry for sibling pub/sub approaches at different altitudes.

What's not yet canonicalised (series to come)

The 2026-05-01 post explicitly names as future-post topics:

  • Inference internals — what happens inside the serving cluster host after routing resolves the VIP.
  • Feature fetching — how the fact store + facts API + online feature sources compose with the serving workflow.
  • Interaction with routing — how routing, inference, and feature fetching combine end-to-end in the production critical path.

Relationship to other Netflix platforms

  • Data Gateway + KV DAL — sibling abstraction posture at the storage tier; both expose a platform-owned contract that shields clients from backend churn while centralizing operator-side flexibility.
  • Post-Training Framework — training-side counterpart; post-training uses thin library on OSS posture, routing uses Envoy-as-data-plane posture. Both avoid reinventing the generic substrate.
  • MediaFM — a production consumer of the model-serving platform for multimodal media understanding.
  • Netflix Experimentation Platform (referenced, not yet ingested on the wiki) — queried per-request to resolve a userId's cell allocation so the Objective → model → cell rule chain can fire.

Seen in

Last updated · 445 distilled / 1,275 read