Skip to content

CONCEPT Cited by 1 source

Model serving vs model inference

Netflix explicitly distinguishes model serving from model inference in its ML-platform architecture. The distinction is small in sentence form but load-bearing for the platform's routing, feature-fetching, and abstraction design.

  • Model inference: infer(features) -> score. Takes a feature vector; returns a prediction. Stateless; no side effects; no fact lookups; no pre- or post-processing.
  • Model serving: end-to-end workflow execution. A model at Netflix "encapsulates pre- and post-processing, feature computation logic, and an optional ML-trained component, all packaged in a standard format suitable for use across multiple contexts." Serving is the full execution of that workflow — it fetches facts from other microservices mid-execution, computes features, optionally runs the ML-trained step, and produces a business output.

Why the distinction matters for architecture

Netflix's routing and API abstractions — see concepts/objective-abstraction and systems/netflix-model-serving-platform"operate at the level of workflows, not just individual scoring functions." Concrete consequences:

  • Input shape. A serving request carries business context (userId, country, device + titles to rank) — not a pre-computed feature vector.
  • Backend dependencies. Serving workflows call adjacent microservices and Netflix's ML fact store during execution for feature data.
  • Packaging. A "model" on the platform is a self-contained workflow artifact, not a weights file + inference runtime.
  • Routing design. The routing layer (Switchboard / Lightbulb) needs to know about the workflow boundary, not the infer() call — e.g. shadow-mode duplication happens at the workflow request level, not the tensor level.

Offline vs online

The distinction also affects Netflix's training-serving coherence: "during offline training, Netflix's ML fact store provides snapshots for bulk access to facilitate feature computation." The same workflow code consumes online facts during serving and offline snapshots during training. This is structurally similar to the concepts/unified-feature-preprocessing-training-vs-serving pattern but Netflix's take is one level up: the whole workflow is the unit of consistency, not just the feature preprocessing.

Contrast with "inference server" shape

Many ML platforms (TensorFlow Serving, TorchServe, Triton Inference Server, SageMaker endpoints) are inference servers — they host infer(features) -> score. The client (or a feature-fetch layer outside the server) is responsible for assembling features. Netflix instead internalises the workflow, so clients only provide request context.

Implications of Netflix's shape:

  • Pro: Clients never duplicate feature-fetch logic across use cases. Model researchers own feature freshness / correctness end-to-end.
  • Pro: Models can evolve their feature dependencies without client changes (add a new fact dependency; the serving platform just starts calling the adjacent microservice).
  • Con: Serving hosts need outbound network and credentials to reach many microservices + the fact store — a heavier runtime than a pure tensor-server.
  • Con: Workflow language / packaging standardisation is now a platform problem (the post mentions "packaged in a standard format" but doesn't describe the format).

Seen in

  • sources/2026-05-01-netflix-state-of-routing-in-model-serving — first canonical wiki articulation of Netflix's serving-vs-inference distinction. The post says explicitly: "model inference typically focuses only on an infer(features) -> score capability, [whereas] models at Netflix act as self-contained workflows that transform inputs to outputs." Flagged as load-bearing for the routing + abstraction design that follows.
Last updated · 445 distilled / 1,275 read