CONCEPT

Inference Pipeline Model¶

Definition¶

An inference pipeline model is a model-serving composition where a single endpoint / service exposes a chain of containers (or processing stages) that run in-process on the same instance, each transforming the request on the way to the model and (optionally) on the way out. The chain is a linear pipeline, not a DAG; each stage's output is the next stage's input; there is no network hop between stages.

Canonical shape¶

The Zalando 2021 disclosure of a SageMaker inference pipeline model for deferred-payment fraud scoring:

[ request JSON ]
    → scikit-learn container (feature extraction)
    → main model container (XGBoost / PyTorch / TF prediction)
    → [ response ]

Both containers are Docker images deployed together behind one SageMaker endpoint instance via the SageMaker Inference Pipeline Model primitive. From the caller's point of view, it is one endpoint with one request/response shape.

Why it exists as a serving primitive¶

Preprocessing needs to happen somewhere in the request path. Production feature extraction (parsing request JSON, computing derived features, handling missing values, one-hot encoding) is an intrinsic part of serving; it has to run before the model.
External preprocessing service → extra network hop + extra deploy + extra failure mode. A separate microservice adds an SLO to meet, an independent deployment surface, and a cold-start dimension.
Preprocessing inside the model container → framework lock-in. If preprocessing is bundled into the XGBoost/PyTorch/TF image, changing the main framework means rewriting preprocessing too.
Inference pipeline model splits the two concerns without adding a network hop. Feature extraction lives in its own (scikit-learn) container; the model container stays framework-appropriate; the stages chain in-process on one instance.

Trade-offs¶

(+) Preprocessing parity with training by construction — the same container image is used as the preprocessing step in a batch transform during training, and as stage 1 of serving. See concepts/unified-feature-preprocessing-training-vs-serving.
(+) Main model container can be swapped (XGBoost → PyTorch → TensorFlow) without rewriting feature extraction.
(+) No network hop / independent-service failure mode between preprocessing and model.
(–) Both containers share one instance's compute budget. CPU / memory sizing must accommodate the union.
(–) Stage parallelism inside one request is limited by the container chain — it's a linear pipeline.
(–) Scaling is coupled: you scale preprocessing + model together per-endpoint; you cannot scale preprocessing independently.

Contrast with peer serving shapes¶

External preprocessing microservice — preprocessing has its own endpoint and autoscaling; adds a network hop.
Monolithic container with preprocessing + model baked in — one image, framework-locked; harder to swap model frameworks.
Multi-model endpoint — one endpoint serving many models via artefact load-on-demand; orthogonal to the inference-pipeline-model shape (an MME can itself expose inference pipelines per model).

Seen in¶

— Zalando Payments' deferred-payment risk-scoring pipeline, first wiki canonical instance. scikit-learn preprocessing container + XGBoost / PyTorch / TensorFlow main-model container co-located on each SageMaker endpoint. Zalando calls out that the same scikit- learn image is the training-time preprocessing step (batch transform job), resolving the preprocessing-parity problem by image equality rather than code discipline.