Skip to content

CONCEPT Cited by 1 source

Data pre-processing vs data transformation split

Definition

Data pre-processing and data transformation are two distinct stages of feature-engineering pipelines, with different objectives, tooling, and scaling characteristics. Zalando's 2025-06-29 ZEOS post names this split as load-bearing architectural vocabulary for the whole ML platform.

Criterion Data pre-processing Data transformation
Primary objective Model upstream data products to represent the business problem in a human-understandable structure, enabling easier validation and analysis Engineer features from pre-processed data to maximise predictive signals for model training
Example operations Joins, filters, aggregations Encoding, normalisation
Typical libraries PySpark, Spark-SQL Pandas, scikit-learn, NumPy, Numba
Architectural advantage Distributed processing enables efficient handling of very large input volumes Significantly improved efficiency by operating on already-clean pre-processed data rather than raw events
Scaling strategy Horizontal — add worker nodes as data volume grows Vertical — dependent libraries lack native distribution; scale up the instance

Why the split matters

  • Different scaling levers. You cannot just "add more workers" to solve slow Pandas / scikit-learn / NumPy code — those libraries don't parallelise natively. Conversely, you don't need PySpark's overhead for a filter-and-join on already-cleaned data. Pick the right tier for the right problem.
  • Separate output contracts. Pre-processing produces "time-series representations of all articles' sales and availability over a configurable timeline" — ready for analysis and validation. Transformation produces "feature vectors" — ready for model training.
  • Independent review / debugging paths. The pre-processing output is human-inspectable (cleaned time-series); the transformation output is model-optimised (encoded, normalised, lagged).

Where time-series-specific feature generation lives

Zalando explicitly notes that target lags / transformations, exogenous feature lags / transformations, and temporal features are handled in neither tier — they are handed off to Nixtla's MLForecast (which uses Numba under the hood). The split is a platform-level vocabulary, not a prescription for every feature.

Canonical instance (Zalando ZEOS)

Both the demand forecaster and replenishment recommender pipelines use this split verbatim:

Seen in

Last updated · 501 distilled / 1,218 read