CONCEPT Cited by 1 source

Data pre-processing vs data transformation split¶

Definition¶

Data pre-processing and data transformation are two distinct stages of feature-engineering pipelines, with different objectives, tooling, and scaling characteristics. Zalando's 2025-06-29 ZEOS post names this split as load-bearing architectural vocabulary for the whole ML platform.

Criterion	Data pre-processing	Data transformation
Primary objective	Model upstream data products to represent the business problem in a human-understandable structure, enabling easier validation and analysis	Engineer features from pre-processed data to maximise predictive signals for model training
Example operations	Joins, filters, aggregations	Encoding, normalisation
Typical libraries	PySpark, Spark-SQL	Pandas, scikit-learn, NumPy, Numba
Architectural advantage	Distributed processing enables efficient handling of very large input volumes	Significantly improved efficiency by operating on already-clean pre-processed data rather than raw events
Scaling strategy	Horizontal — add worker nodes as data volume grows	Vertical — dependent libraries lack native distribution; scale up the instance

Why the split matters¶

Different scaling levers. You cannot just "add more workers" to solve slow Pandas / scikit-learn / NumPy code — those libraries don't parallelise natively. Conversely, you don't need PySpark's overhead for a filter-and-join on already-cleaned data. Pick the right tier for the right problem.
Separate output contracts. Pre-processing produces "time-series representations of all articles' sales and availability over a configurable timeline" — ready for analysis and validation. Transformation produces "feature vectors" — ready for model training.
Independent review / debugging paths. The pre-processing output is human-inspectable (cleaned time-series); the transformation output is model-optimised (encoded, normalised, lagged).

Where time-series-specific feature generation lives¶

Zalando explicitly notes that target lags / transformations, exogenous feature lags / transformations, and temporal features are handled in neither tier — they are handed off to Nixtla's MLForecast (which uses Numba under the hood). The split is a platform-level vocabulary, not a prescription for every feature.

Canonical instance (Zalando ZEOS)¶

Both the demand forecaster and replenishment recommender pipelines use this split verbatim:

Pre-processing tier → PySpark on Databricks transient job clusters writing to Delta Lake.
Transformation tier → SageMaker Processing Job with Pandas / scikit-learn / NumPy / Numba.

Seen in¶

sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-dive — canonical source; the post's own summary table codifies the two tiers.

concepts/horizontal-vs-vertical-scalability-for-feature-engineering — scaling strategy per tier.
systems/apache-spark · systems/databricks — pre-processing substrate.
systems/sagemaker-processing-job · systems/numba — transformation substrate.
systems/zeos-demand-forecaster · systems/zeos-replenishment-recommender
patterns/pyspark-preprocessing-to-python-transformation-split
companies/zalando