CONCEPT Cited by 1 source
Data pre-processing vs data transformation split¶
Definition¶
Data pre-processing and data transformation are two distinct stages of feature-engineering pipelines, with different objectives, tooling, and scaling characteristics. Zalando's 2025-06-29 ZEOS post names this split as load-bearing architectural vocabulary for the whole ML platform.
| Criterion | Data pre-processing | Data transformation |
|---|---|---|
| Primary objective | Model upstream data products to represent the business problem in a human-understandable structure, enabling easier validation and analysis | Engineer features from pre-processed data to maximise predictive signals for model training |
| Example operations | Joins, filters, aggregations | Encoding, normalisation |
| Typical libraries | PySpark, Spark-SQL | Pandas, scikit-learn, NumPy, Numba |
| Architectural advantage | Distributed processing enables efficient handling of very large input volumes | Significantly improved efficiency by operating on already-clean pre-processed data rather than raw events |
| Scaling strategy | Horizontal — add worker nodes as data volume grows | Vertical — dependent libraries lack native distribution; scale up the instance |
Why the split matters¶
- Different scaling levers. You cannot just "add more workers" to solve slow Pandas / scikit-learn / NumPy code — those libraries don't parallelise natively. Conversely, you don't need PySpark's overhead for a filter-and-join on already-cleaned data. Pick the right tier for the right problem.
- Separate output contracts. Pre-processing produces "time-series representations of all articles' sales and availability over a configurable timeline" — ready for analysis and validation. Transformation produces "feature vectors" — ready for model training.
- Independent review / debugging paths. The pre-processing output is human-inspectable (cleaned time-series); the transformation output is model-optimised (encoded, normalised, lagged).
Where time-series-specific feature generation lives¶
Zalando explicitly notes that target lags / transformations, exogenous feature lags / transformations, and temporal features are handled in neither tier — they are handed off to Nixtla's MLForecast (which uses Numba under the hood). The split is a platform-level vocabulary, not a prescription for every feature.
Canonical instance (Zalando ZEOS)¶
Both the demand forecaster and replenishment recommender pipelines use this split verbatim:
- Pre-processing tier → PySpark on Databricks transient job clusters writing to Delta Lake.
- Transformation tier → SageMaker Processing Job with Pandas / scikit-learn / NumPy / Numba.
Seen in¶
- sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-dive — canonical source; the post's own summary table codifies the two tiers.
Related¶
- concepts/horizontal-vs-vertical-scalability-for-feature-engineering — scaling strategy per tier.
- systems/apache-spark · systems/databricks — pre-processing substrate.
- systems/sagemaker-processing-job · systems/numba — transformation substrate.
- systems/zeos-demand-forecaster · systems/zeos-replenishment-recommender
- patterns/pyspark-preprocessing-to-python-transformation-split
- companies/zalando