NETFLIX 2026-05-01

Netflix — State of Routing in Model Serving¶

Summary¶

The 2026-05-01 Netflix TechBlog post by Nipun Kumar, Rajat Shah, and Peter Chng is the first post in a multi-part series on Netflix's centralized ML model-serving platform — the infrastructure behind personalized experiences (title recommendations, fraud detection, commerce) that serves hundreds of model types and versions at ~1 million requests per second as of 2025. The post zooms in on the routing layer: how the platform decides which model instance, on which cluster shard, for which user and use case, while preserving a simple abstraction for both clients and researchers. It describes an architectural evolution from Switchboard (a custom centralized gRPC proxy sitting in the critical path of every inference request) to Lightbulb (a metadata-only service that informs Envoy-based routing at the data plane). Three load-bearing abstractions are named: (1) the Objective enum — a business-use-case identifier (e.g. ContinueWatchingRanking) that decouples clients from concrete model IDs; (2) Switchboard Rules — a JavaScript-authored rule set (published as JSON via Gutenberg pub/sub) that binds Objectives to models, A/B test cells, canary splits, and shadow-mode routing without code deploys; (3) the separation of model-selection from request-routing — Lightbulb resolves request context → routingKey + ObjectiveConfig, and Envoy maps routingKey → cluster VIP at near-zero overhead. The arc is motivated by three scaling pains of Switchboard-in-the-path: single-point-of-failure across ≥30 client services, 10–20ms of serialization-tax latency per request (plus tail-latency amplification), and tenant-isolation gaps that complicated real-vs-artificial traffic separation for training-data logging.

Key takeaways¶

Model serving ≠ model inference at Netflix. Netflix defines a "model" as an end-to-end workflow — pre-processing, feature computation, optional ML-trained component, post-processing — packaged in a standard format. "Model inference typically focuses only on an infer(features) -> score capability", whereas "we refer to the end-to-end execution of this workflow as model serving". This distinction is load-bearing for the routing design: the platform routes workflows (which fetch facts from other microservices mid-execution), not just stateless scoring functions. The concepts/model-serving-vs-model-inference distinction is canonicalized on the wiki by this post (Source).
The Objective enum is the contract between clients and the platform. Every request must carry an Objective — a platform-defined enumeration like ContinueWatchingRanking or PaymentFraudDetection — and clients provide only standard request context (userId, country, device) plus domain context (titles to rank, transaction details). The platform owns model selection, experiment allocation, feature fetching, and VIP routing. Clients never learn which concrete model ran. This enables "almost all model iterations, including intermediate model A/B experiments, [to be] mostly opaque to the calling apps" — a one-time integration effort, thereafter decoupled (Source).
Switchboard's first principles: three named properties. The post articulates three platform principles: (a) model innovation independent of client apps — clients don't redeploy when a model A/B kicks off; (b) decouple clients from model sharding — VIP addresses change frequently as models move between cluster shards based on traffic / SLA / architecture / CPU-memory fit, and clients never see this; (c) flexible traffic routing rules — support experiment-driven, gradual-shift, and client-override routing. These three properties together define what generic proxies (AWS API Gateway, stand-alone service-mesh proxies) could not provide (Source).
Off-the-shelf proxies were explicitly rejected. Netflix states "standard out-of-the-box API Gateway solutions did not meet all our requirements" — specifically citing (i) required first-class integration with Netflix's Experimentation Platform, (ii) gRPC endpoint exposure, (iii) rich domain-context routing, and (iv) model-specific lifecycle stages (shadow mode, canaries, rollbacks). The build-vs-buy decision is explicit; Switchboard is the built answer (Source).
Switchboard Rules decouple config from code via pub/sub. Researchers author rules as a JavaScript configuration (example: a function defineAB12345Rule() that binds Objectives.ContinueWatchingRanking to a cell-to-model map for A/B test #12345), which compiles to a JSON rule set published via Gutenberg — Netflix's pub/sub system that provides versioning, dynamic loading, and rollback. Both Switchboard and the serving cluster hosts subscribe to the same rule stream. This independent release cycle for rules, separate from code deploys, is the load-bearing mechanism that lets A/B experiments kick off without platform releases (Source; also anchored by Netflix's Gutenberg post, not ingested on this wiki).
Three Switchboard pains forced the architecture shift. At scale, Switchboard in the critical path became a liability on three named axes: (a) single point of failure — a Switchboard outage would "degrade or disable multiple ML-powered experiences" across ≥30 client services; (b) added latency — "Switchboard in the request path adds between 10–20ms of latency due to serialization-deserialization operations, depending on payload size", plus exposes requests to tail-latency amplification; (c) reduced client flexibility — Switchboard "obscures visibility into client request origins from the serving clusters", making real-vs-artificial traffic separation for training-data logging costly. Large payloads were a specific contributor to (b) because the proxy had to deserialize + re-serialize to make a routing decision (Source).
Lightbulb splits routing metadata from routing itself. Lightbulb consumes minimal request context, resolves it to a routingKey (placed in HTTP headers) and an ObjectiveConfig (model ID + request-specific parameters, appended to the request body). Envoy — which Netflix already uses for all egress communication between apps — maps routingKey → cluster VIP at the data plane with minimal overhead. Netflix keeps Switchboard's abstraction (one integration point, Objective as the contract, context-aware routing, flexible experimentation config) while removing the in-path proxy. "Having a single service in the active request path introduces another failure mode and limits fallback flexibility. While routing rules change infrequently, maintaining consistency comes at the cost of increased availability risks." (Source).
Three specific design choices distinguish Lightbulb from Switchboard. (a) Remove the routing service from the direct request path — solves the SPOF + latency problems by moving routing decisions to Envoy, an already-deployed data-plane component. (b) Separate model inputs from request metadata — large payloads were a latency contributor because Switchboard had to deserialize+re-serialize them to route; Lightbulb puts routing info in headers (small) and keeps model inputs in the body (passed through by Envoy, never re-parsed). (c) Provide better isolation for the routing layer — Lightbulb's per-request resolver work can be sharded / isolated per tenant without forcing all tenants through one proxy cluster (Source).
Model-selection is still centralized; only routing is distributed. Lightbulb "resolves the request context to determine a routingKey configuration along with the ObjectiveConfig — this is where we place the model id along with other request-specific configurations required for model execution. This is done to separate the config resolution associated with the request from the placement and routing information needed to reach it on the inference cluster." The Objective → model → VIP chain is split: Lightbulb owns Objective→model (the research-facing decision), Envoy owns model→VIP (the platform-facing decision) (Source).

Architectural numbers + operational notes (from source)¶

Scale (2025): "hundreds of model types and versions, netting 1 million requests per second."
Client surface: "over 30 service clients" integrated with Switchboard.
Switchboard-added latency: "between 10–20ms of latency due to serialization-deserialization operations, depending on payload size"; additional tail-latency amplification from in-path proxy exposure.
Example Objectives named: ContinueWatchingRanking (personalized Continue Watching row on the Netflix homepage; input = UserId + Country + DeviceId; output = ranked list of titleIds); Payment Fraud Detection (input = UserId + Country + transaction details; output = fraud probability).
Routing-rule example: A/B test #12345 on ContinueWatchingRanking — Cell 1 → netflix-continue-watching-model-default (productized), Cell 2 → netflix-continue-watching-model-cell-2 (candidate), Cell 3 → netflix-continue-watching-model-cell-3 (candidate).
Rule format: JavaScript → JSON (via Gutenberg).
Rule consumer discipline: Both Switchboard and serving cluster hosts subscribe to the same Switchboard Rules stream; race conditions prevented by a flow-diagram the post includes (not textually reproduced here).
Lightbulb data-plane substrate: Envoy — "already used for all egress communication between apps at Netflix".
Ingredient split: routingKey → HTTP headers (Envoy routing input, small); ObjectiveConfig → request body (model-execution parameters, not headers, to avoid header bloat).
Nothing disclosed about: concrete Envoy cluster/endpoint count, control-plane architecture for Envoy (is it xDS from Lightbulb, or a separate control plane?), Lightbulb replication / sharding, Switchboard cluster size / regional topology, Gutenberg-side publish cadence, concrete SLAs for serving clusters, feature-fetching data path from the fact store / adjacent microservices, shadow-mode traffic-duplication mechanism post-Lightbulb, specific percentages of traffic split in canary flows.

Systems extracted¶

New wiki pages:

systems/netflix-model-serving-platform — Netflix's centralized ML model-serving platform; 1M req/sec, hundreds of model types/versions, >30 client service integrations. Owns model-to-cluster sharding, Objective contract, experimentation integration. First canonical wiki home.
systems/netflix-switchboard — the pre-2026 custom gRPC proxy acting as a mandatory interface for all client → model inference traffic. Flexible but in-path: 10–20ms serialization tax, SPOF for ≥30 clients, tenant-isolation gaps. Built to expose rich A/B-test-aware, Objective-addressable routing that off-the-shelf API gateways couldn't. Superseded (not replaced) by Lightbulb + Envoy. First canonical wiki home.
systems/netflix-lightbulb — the 2026 metadata-resolver replacing Switchboard's in-path role. Takes minimal request context → produces routingKey (headers for Envoy) + ObjectiveConfig (body for model execution). Out of the critical path for payload bytes; Envoy handles the actual connect. First canonical wiki home.
systems/netflix-gutenberg — Netflix's dataset pub/sub system. Substrate for Switchboard Rules (versioned, dynamically loadable, rollback-capable JSON rule set subscribed by both Switchboard/Lightbulb and serving hosts). Stub page; canonical source is Netflix's earlier How Netflix Microservices Tackle Dataset Pub-Sub post (not ingested).

Extended (cross-link added):

systems/envoy — Netflix's migration to Envoy-as-model-serving-data-plane adds a seventh role to Envoy's wiki taxonomy: ML-inference-routing data plane. Sibling to sidecar-mesh / edge-proxy / EDS-client / JWT-validator / egress-SSRF-guard / µVM-orchestrator-ingress. Envoy routes on headers set by Lightbulb (routingKey → cluster VIP); body payload passes through unparsed. Netflix's broader Envoy-for-all-egress stance (zero-configuration service mesh with on-demand cluster discovery) is the substrate this new role builds on.

Concepts extracted¶

New wiki pages:

concepts/objective-abstraction — a use-case-as-enum identifier (e.g. ContinueWatchingRanking) that decouples clients from concrete model IDs, lets A/B experiments swap models opaquely, and centralizes model-selection to the serving platform. The contract between clients and Netflix's model-serving platform; load-bearing for client-app / model-iteration decoupling.
concepts/model-serving-vs-model-inference — Netflix's explicit distinction: inference = infer(features) -> score; serving = full workflow including pre/post-processing, feature computation, and optional ML-trained component. The routing design operates at serving granularity, not inference granularity.
concepts/vip-address-decoupling — clients addressing a logical use case (Objective) rather than a concrete VIP. Model-to-cluster-shard mapping changes based on traffic / SLA / model architecture / resource fit, and those VIP changes are absorbed by the routing layer rather than propagated to clients.
concepts/serialization-tax-in-proxy-path — the latency cost of an in-path proxy that must deserialize + re-serialize each request to make a routing decision. Netflix quantifies this at 10–20ms for Switchboard, depending on payload size. Motivates the Lightbulb split: routing metadata small + in headers (Envoy passes through); payload body large + untouched.
concepts/tenant-isolation-in-routing-layer — running multiple use cases through a single routing cluster creates cross-tenant blast-radius and SLA-heterogeneity problems: a surge in one tenant's traffic can cascade errors to others; diverse latency requirements are hard to honour uniformly. Lightbulb's per-request resolver can be sharded per tenant; Envoy as data plane isolates at the connection/cluster level.

Patterns extracted¶

New wiki pages:

patterns/separate-routing-from-model-selection — split the Objective→model decision (needs experimentation-platform context, rule evaluation, researcher-authored configs) from the model→VIP decision (needs low-overhead header-to-cluster mapping). Lightbulb owns the former; Envoy owns the latter. Keeps the researcher-facing flexibility while removing the in-path proxy's SPOF + serialization tax.
patterns/centralized-routing-proxy-for-ml-serving — the Switchboard shape: one mandatory gRPC proxy fronting all inference traffic, providing Objective→model resolution + A/B-test-aware routing + shadow-mode + canary. Canonical in the bootstrap phase of a multi-tenant model-serving platform when researcher-config velocity matters more than latency; the pattern's load-bearing weakness (SPOF + serialization tax at 1M req/sec scale) is documented alongside.
patterns/config-separated-from-code-via-pubsub — publish configuration (here: routing rules) as versioned JSON via a pub/sub substrate; both the routing service and the downstream consumers (serving hosts) subscribe. Independent release cycle for config decoupled from code deploys; versioning + dynamic loading + rollback inherited from the pub/sub layer. Netflix's Gutenberg is the canonical substrate.

Caveats + what's missing from the post¶

This is part 1 of a multi-part series. "In future posts in this series, we'll dive deeper into other aspects of our ML serving platform, including inference and feature fetching, and how they interact with the routing architecture described here." Inference internals + feature fetching + fact-store integration are explicitly deferred; the current post is routing-only.
Lightbulb control-plane details not disclosed. The post says "Envoy … can route requests to different clusters (VIPs) based on the configurable Routing Rules published from our control plane" but does not name the control plane (xDS-based? Gutenberg-based? Lightbulb-provided?) or describe propagation-latency from rule publish to Envoy-cluster-update.
No migration numbers. The shift from Switchboard to Lightbulb is presented as a design decision; no numbers on how much of the 1M req/sec traffic is on Lightbulb vs Switchboard today, nor how long the migration took.
No post-migration latency disclosure. The 10–20ms Switchboard number is named, but Lightbulb-plus-Envoy's p50/p99 latency is not quantified.
Shadow-mode mechanism post-split is unspecified. Switchboard-era shadow mode "[routed] production traffic to a new model version without affecting the user experience". Whether Lightbulb/Envoy shadow-mode works the same way (duplicate the request at Envoy; compare responses off-path) or uses a different mechanism is not described.
Race-condition flow for rule sync is referenced but not textually reproduced. "To prevent race conditions and ensure proper sync of the dynamic Switchboard Rules configuration, the following flow is considered" — followed by a diagram the post does not describe in prose. Discipline is named (both Switchboard and serving hosts subscribe to same rule stream), details not disclosed.
Envoy is used but Envoy isn't the whole story. Envoy provides routing + cluster selection; it does not provide Objective-resolution or A/B-cell-allocation. Lightbulb is the missing piece. Netflix is explicit that generic proxies alone did not meet their requirements — this is not a "we used Envoy" post, it is a "we built Lightbulb to augment Envoy" post.
Experimentation Platform integration details not disclosed. The post references Netflix's Experimentation Platform and mentions "query the Experimentation Platform for the userId's cell allocation". Whether that query is synchronous per-request (latency-adding) or cached / pre-resolved is not stated.
Reliability / availability numbers for the new architecture not disclosed. The shift is motivated by reliability concerns but no post-migration availability / blast-radius reduction numbers are cited.
Not ingested on this wiki: Netflix's fact store post, Gutenberg pub/sub post, on-demand service-mesh cluster-discovery post, Experimentation Platform post. Cross-refs are hyperlinks only; no canonicalization of those systems' internals beyond what this post references.

Source¶

companies/netflix
sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer — sibling abstraction-platform post. The KV DAL is to storage what Switchboard/Lightbulb is to model serving: a platform-owned contract that shields clients from backend churn (engine switches, sharding changes) while centralizing operator-facing flexibility. Both run on top of Netflix's shared substrate discipline — KV DAL on the Data Gateway, routing on Envoy.
sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — sibling ML-platform post. Post-Training Framework is the training-side counterpart to this serving-side routing layer; both presume the thin-library-on-top-of-OSS-compute-platform posture (patterns/thin-library-on-top-of-oss-compute-platform).
sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — one of the larger consumers of serving infra (multimodal model family serving video-understanding queries).

Netflix — State of Routing in Model Serving¶

Summary¶

Key takeaways¶

Architectural numbers + operational notes (from source)¶

Systems extracted¶

Concepts extracted¶

Patterns extracted¶

Caveats + what's missing from the post¶

Source¶

Related¶