SYSTEM

SageMaker Inference Pipeline Model¶

SageMaker Inference Pipeline Model is a SageMaker model composition primitive that chains two or more Docker containers inside a single SageMaker endpoint. A request flows through the containers in-process on the same instance — the output of one is the input of the next — with no inter-container network hop to an external service.

The canonical shape disclosed in :

[ incoming JSON request ]
         │
         ▼
┌──────────────────────────────┐
│ scikit-learn container        │  ← feature preprocessing
│  (JSON → feature vector)      │     (shared with training)
└──────────────────────────────┘
         │
         ▼
┌──────────────────────────────┐
│ main model container          │  ← XGBoost / PyTorch / TF
│  (feature vector → prediction)│
└──────────────────────────────┘
         │
         ▼
[ response JSON ]

Why it matters architecturally¶

Unifies feature preprocessing across training and serving. Because the scikit-learn container is exactly the same image used as a batch-transform step during training, the training vs serving preprocessing-parity problem is solved by construction, not by discipline (Source: ).
Keeps the preprocessing hop local. In-process chaining on the same instance avoids the latency and failure-mode budget of a real network hop between preprocessing and model serving.
Lets the "main model" be any SageMaker-compatible container. Zalando explicitly names XGBoost, PyTorch, and TensorFlow as interchangeable second-stage containers. This is how the framework-flexibility requirement in the Zalando migration is met without giving up a shared preprocessing layer (Source: ).
Lightweight for scale-up. Zalando's post reports that the inference-pipeline-model containers are "lightweight and optimized for serving" and scale up sufficiently fast to resolve the legacy system's slow-startup pain point.

Relationship to peer primitives¶

Distinct from Multi-Model Endpoints (one endpoint, many models multiplexed via model-artefact load-on-demand) — an inference pipeline model is one model exposed via a chain of containers, not many models multiplexed on one endpoint.
Distinct from SageMaker Batch Transform — inference pipelines serve per-request; batch transform scores a whole S3 dataset offline.

Seen in¶

— canonical first wiki disclosure. Two-container pipeline (scikit-learn preprocessing + XGBoost/PyTorch main model) serving Zalando Payments' deferred-payment risk-scoring models. One endpoint per model per model-per-endpoint isolation trade-off. Load-tested at 200–1000 RPS across m5.large / m5.4xlarge / m5.12xlarge.

SageMaker Inference Pipeline Model¶

Why it matters architecturally¶

Relationship to peer primitives¶

Seen in¶

Related¶