Skip to content

CONCEPT Cited by 1 source

Unified Feature Preprocessing across Training vs Serving

Definition

A design requirement in production ML: the preprocessing applied to incoming requests in production must be identical to that applied to the training data — so that a model trained on feature vectors derived one way doesn't silently see a different distribution of feature vectors derived a different way at serving time. The fix is usually one implementation of preprocessing, used in both training and serving, rather than two separately-written pipelines that happen to be claimed-equivalent.

Why this matters

  1. Two codebases drift. If feature extraction is written once in a training script (Python / pandas / numpy) and again in a serving path (often a different language or framework), any bug-fix or schema evolution has to be applied in two places. The two inevitably drift, producing online/offline discrepancy and silently degraded model accuracy in production.

  2. Silent discrepancy is the worst failure mode. Unlike a crash, feature-preprocessing drift doesn't error — it just produces predictions off the intended distribution. Quality metrics degrade gradually; root-cause attribution is hard.

  3. ML platforms know this. Zalando names it as a hard requirement in the 2021 deferred-payment pipeline rewrite:

"The preprocessing applied to incoming requests in production must be identical to that applied to the training data. We want to avoid implementing this logic twice for both cases." (Source: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference)

Unification strategies

Strategy A — shared container image (Zalando pattern)

Write preprocessing once as a Docker container (e.g. a scikit-learn image). Use it:

  • At training time as a batch-transform preprocessing step over the raw training dataset → feature dataset → model training.
  • At serving time as the first stage of a SageMaker Inference Pipeline Model that takes request JSON in and emits a feature vector to the next container (main model).

The image is binary-identical in both paths; drift is impossible by construction. Canonical instance: sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inference.

Strategy B — shared feature-store abstraction

A feature store (Lyft, Uber, etc.) exposes the same feature lookup/compute API for both training batch jobs and online serving; derivations are declared once and executed by the store in the right mode. Different shape from Strategy A — stores the values rather than just sharing the code.

Strategy C — shared library / SDK

Both training and serving import the same preprocessing library (e.g. a features_v3.py module). Less hermetic than a shared image (different language runtimes or dependency versions can still diverge), but cheap for a single-language shop.

Failure modes the unification prevents

  • Different library versions — training uses scikit-learn==1.0, serving uses scikit-learn==0.24, one-hot-encoding order changes.
  • Different language — training in Python / pandas, serving in Go / Java reimplements bucketing; float boundary off-by-one.
  • Different default handling — training fills NaN with 0, serving crashes or fills with mean-of-zero.
  • Different feature order — training schema reordered; serving still ships old order; model interprets features in wrong positions.

Seen in

  • sources/2021-02-15-zalando-a-machine-learning-pipeline-with-real-time-inferenceStrategy A canonical instance. scikit-learn container used simultaneously as a SageMaker batch-transform preprocessing step during training and as stage 1 of the serving-time inference pipeline model. Zalando names the requirement verbatim and names the SageMaker Inference Pipeline Model as the mechanism that satisfies it.
Last updated · 476 distilled / 1,218 read