PATTERN Cited by 1 source
Unified Feature Extraction for Training and Serving¶
Intent¶
Write feature extraction once and use the same artefact in both training and serving. Avoid the canonical bug where training preprocessing and serving preprocessing are two separate implementations that drift apart, silently producing a different feature distribution at serve time than the model was trained on.
Context¶
Production ML pipelines need feature extraction in two places:
- Offline, during training — derive features from a large historical dataset to produce the training set the model learns from.
- Online, during serving — derive features from incoming request JSON to produce the inference input the model scores.
The temptation is to write these twice — training in a notebook or Spark job, serving in a microservice in a different language. The two will drift.
Solution¶
Package feature extraction as a single container image (commonly a scikit-learn image) and deploy it in both paths:
- Training path — run it as a batch-transform step over the historical raw dataset; the output becomes the training dataset.
- Serving path — run it as the first stage of a SageMaker Inference Pipeline Model; its output feeds the main-model container's inference.
Because both invocations use the same binary image, the two paths are identical by construction. No code discipline required.
Canonical instance¶
Zalando Payments' 2021 deferred-payment risk-scoring pipeline (sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference):
Training path (Step Functions workflow):
raw JSON training set
→ scikit-learn container (SageMaker batch transform)
→ feature training set
→ SageMaker training job (XGBoost)
→ model artefact
Serving path (SageMaker endpoint):
request JSON
→ scikit-learn container (pipeline stage 1) ← same image
→ feature vector
→ XGBoost container (pipeline stage 2)
→ prediction
Stated goal, verbatim:
"The preprocessing applied to incoming requests in production must be identical to that applied to the training data. We want to avoid implementing this logic twice for both cases." (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference)
Consequences¶
Pros:
- Preprocessing / training-serving drift becomes impossible by construction.
- Feature-engineering changes (new feature, bucket boundary change, NaN-handling tweak) ship via one artefact update that both paths pick up.
- Main-model container is free to swap frameworks (XGBoost → PyTorch → TF) without touching feature code.
Cons:
- Requires a platform with a multi-container / pipeline abstraction (SageMaker Inference Pipeline Model; or equivalent in a custom serving stack).
- Container must be efficient enough to run both in batch (over training data) and online (per-request latency).
- One container image to build + test means one CI pipeline + one release surface for the preprocessing layer.
Variants¶
- Feature store (Lyft, Uber) — a store-owned derivation engine that exposes the same feature API for training batch jobs and online serving; richer because it also caches values, not just shares the code. See systems/lyft-feature-store.
- Shared library instead of shared container — both training and serving import a common Python package. Weaker hermetic guarantee: differing runtime / dependency versions can still drift.
- DSL compiled to both paths — declare features in a spec language that generates both the training pipeline and the serving pipeline. Requires a platform team capable of building + maintaining the compiler.
Seen in¶
- sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference — Zalando Payments, 2021. scikit-learn container used as a SageMaker batch-transform step during training and as stage 1 of a SageMaker Inference Pipeline Model at serving time. Canonical first wiki instance.