CONCEPT Cited by 1 source
Unified Feature Preprocessing across Training vs Serving¶
Definition¶
A design requirement in production ML: the preprocessing applied to incoming requests in production must be identical to that applied to the training data — so that a model trained on feature vectors derived one way doesn't silently see a different distribution of feature vectors derived a different way at serving time. The fix is usually one implementation of preprocessing, used in both training and serving, rather than two separately-written pipelines that happen to be claimed-equivalent.
Why this matters¶
-
Two codebases drift. If feature extraction is written once in a training script (Python / pandas / numpy) and again in a serving path (often a different language or framework), any bug-fix or schema evolution has to be applied in two places. The two inevitably drift, producing online/offline discrepancy and silently degraded model accuracy in production.
-
Silent discrepancy is the worst failure mode. Unlike a crash, feature-preprocessing drift doesn't error — it just produces predictions off the intended distribution. Quality metrics degrade gradually; root-cause attribution is hard.
-
ML platforms know this. Zalando names it as a hard requirement in the 2021 deferred-payment pipeline rewrite:
"The preprocessing applied to incoming requests in production must be identical to that applied to the training data. We want to avoid implementing this logic twice for both cases." (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference)
Unification strategies¶
Strategy A — shared container image (Zalando pattern)¶
Write preprocessing once as a Docker container (e.g. a scikit-learn image). Use it:
- At training time as a batch-transform preprocessing step over the raw training dataset → feature dataset → model training.
- At serving time as the first stage of a SageMaker Inference Pipeline Model that takes request JSON in and emits a feature vector to the next container (main model).
The image is binary-identical in both paths; drift is impossible by construction. Canonical instance: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference.
Strategy B — shared feature-store abstraction¶
A feature store (Lyft, Uber, etc.) exposes the same feature lookup/compute API for both training batch jobs and online serving; derivations are declared once and executed by the store in the right mode. Different shape from Strategy A — stores the values rather than just sharing the code.
Strategy C — shared library / SDK¶
Both training and serving import the same preprocessing library
(e.g. a features_v3.py module). Less hermetic than a shared image
(different language runtimes or dependency versions can still
diverge), but cheap for a single-language shop.
Failure modes the unification prevents¶
- Different library versions — training uses
scikit-learn==1.0, serving usesscikit-learn==0.24, one-hot-encoding order changes. - Different language — training in Python / pandas, serving in Go / Java reimplements bucketing; float boundary off-by-one.
- Different default handling — training fills
NaNwith0, serving crashes or fills with mean-of-zero. - Different feature order — training schema reordered; serving still ships old order; model interprets features in wrong positions.
Seen in¶
- sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference — Strategy A canonical instance. scikit-learn container used simultaneously as a SageMaker batch-transform preprocessing step during training and as stage 1 of the serving-time inference pipeline model. Zalando names the requirement verbatim and names the SageMaker Inference Pipeline Model as the mechanism that satisfies it.