SYSTEM Cited by 1 source
SageMaker Inference Pipeline Model¶
SageMaker Inference Pipeline Model is a SageMaker model composition primitive that chains two or more Docker containers inside a single SageMaker endpoint. A request flows through the containers in-process on the same instance — the output of one is the input of the next — with no inter-container network hop to an external service.
The canonical shape disclosed in sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference:
[ incoming JSON request ]
│
▼
┌──────────────────────────────┐
│ scikit-learn container │ ← feature preprocessing
│ (JSON → feature vector) │ (shared with training)
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ main model container │ ← XGBoost / PyTorch / TF
│ (feature vector → prediction)│
└──────────────────────────────┘
│
▼
[ response JSON ]
Why it matters architecturally¶
-
Unifies feature preprocessing across training and serving. Because the scikit-learn container is exactly the same image used as a batch-transform step during training, the training vs serving preprocessing-parity problem is solved by construction, not by discipline (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).
-
Keeps the preprocessing hop local. In-process chaining on the same instance avoids the latency and failure-mode budget of a real network hop between preprocessing and model serving.
-
Lets the "main model" be any SageMaker-compatible container. Zalando explicitly names XGBoost, PyTorch, and TensorFlow as interchangeable second-stage containers. This is how the framework-flexibility requirement in the Zalando migration is met without giving up a shared preprocessing layer (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference).
-
Lightweight for scale-up. Zalando's post reports that the inference-pipeline-model containers are "lightweight and optimized for serving" and scale up sufficiently fast to resolve the legacy system's slow-startup pain point.
Relationship to peer primitives¶
- Distinct from Multi-Model Endpoints (one endpoint, many models multiplexed via model-artefact load-on-demand) — an inference pipeline model is one model exposed via a chain of containers, not many models multiplexed on one endpoint.
- Distinct from SageMaker Batch Transform — inference pipelines serve per-request; batch transform scores a whole S3 dataset offline.
Seen in¶
- sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference — canonical first wiki disclosure. Two-container pipeline (scikit-learn preprocessing + XGBoost/PyTorch main model) serving Zalando Payments' deferred-payment risk-scoring models. One endpoint per model per model-per-endpoint isolation trade-off. Load-tested at 200–1000 RPS across m5.large / m5.4xlarge / m5.12xlarge.