Skip to content

SYSTEM Cited by 1 source

SageMaker Inference Pipeline Model

SageMaker Inference Pipeline Model is a SageMaker model composition primitive that chains two or more Docker containers inside a single SageMaker endpoint. A request flows through the containers in-process on the same instance — the output of one is the input of the next — with no inter-container network hop to an external service.

The canonical shape disclosed in sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference:

[ incoming JSON request ]
┌──────────────────────────────┐
│ scikit-learn container        │  ← feature preprocessing
│  (JSON → feature vector)      │     (shared with training)
└──────────────────────────────┘
┌──────────────────────────────┐
│ main model container          │  ← XGBoost / PyTorch / TF
│  (feature vector → prediction)│
└──────────────────────────────┘
[ response JSON ]

Why it matters architecturally

  • Unifies feature preprocessing across training and serving. Because the scikit-learn container is exactly the same image used as a batch-transform step during training, the training vs serving preprocessing-parity problem is solved by construction, not by discipline (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).

  • Keeps the preprocessing hop local. In-process chaining on the same instance avoids the latency and failure-mode budget of a real network hop between preprocessing and model serving.

  • Lets the "main model" be any SageMaker-compatible container. Zalando explicitly names XGBoost, PyTorch, and TensorFlow as interchangeable second-stage containers. This is how the framework-flexibility requirement in the Zalando migration is met without giving up a shared preprocessing layer (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).

  • Lightweight for scale-up. Zalando's post reports that the inference-pipeline-model containers are "lightweight and optimized for serving" and scale up sufficiently fast to resolve the legacy system's slow-startup pain point.

Relationship to peer primitives

  • Distinct from Multi-Model Endpoints (one endpoint, many models multiplexed via model-artefact load-on-demand) — an inference pipeline model is one model exposed via a chain of containers, not many models multiplexed on one endpoint.
  • Distinct from SageMaker Batch Transform — inference pipelines serve per-request; batch transform scores a whole S3 dataset offline.

Seen in

Last updated · 476 distilled / 1,218 read