PATTERN Cited by 1 source
PySpark preprocessing to Python transformation split¶
Problem¶
Feature-engineering pipelines need to do two different things:
- Heavy-volume SQL-like reshaping — joins, filters, aggregations over raw event data across millions of entities.
- Library-ecosystem-heavy transformation — encoding, normalisation, statistical feature generation using Pandas / scikit-learn / NumPy / Numba idioms.
These two workloads have different scaling characteristics:
- (1) scales horizontally — add PySpark workers.
- (2) scales vertically — a bigger single instance with vectorisation / JIT.
Forcing both into one tier (all PySpark, or all single-machine Python) leaves throughput on the table and makes the pipeline harder to debug.
Pattern¶
Explicitly split feature engineering into two tiers with different compute substrates:
- Horizontal tier — PySpark on a Spark platform.
- Databricks / EMR / self-managed Spark.
- Writes pre-processed output to a columnar format on object storage (Delta Lake / Parquet / Iceberg).
- Operations: joins, filters, aggregations, window functions.
-
Output: clean per-entity time-series or feature vectors.
-
Vertical tier — single-process Python on a managed container.
- SageMaker Processing Job / AWS Batch / ECS task / local process.
- Libraries: Pandas, scikit-learn, NumPy, Numba, SciPy.
- Operations: encoding, normalisation, probabilistic feature generation, model-specific transformations.
The hand-off between tiers is clean-schema output on S3 (or equivalent) — no shared cluster, no live data channel, no tight coupling.
Why this is a distinct pattern¶
Unlike the "always use Spark" or "always use Python" camps, this pattern is a deliberate architectural acknowledgement that the two tiers have irreconcilable scaling profiles — scikit-learn / NumPy / Numba don't natively distribute, and forcing them to is a fool's errand. Pick the right tool at each stage.
Canonical instance (Zalando ZEOS)¶
Zalando's ZEOS inventory-optimisation system uses this exact split for both its demand forecaster and its inventory optimiser. Architectural justification verbatim:
"PySpark enables horizontal scalability in the number of worker nodes as data volume grows." — pre-processing tier.
"Dependent libraries lack native distribution support, so we rely on vertical scalability to handle increasing data volumes." — transformation tier.
See concepts/data-preprocessing-vs-data-transformation-split for Zalando's full architectural vocabulary around this split.
Seen in¶
Related¶
- concepts/data-preprocessing-vs-data-transformation-split — concept page; Zalando's table of the two tiers.
- concepts/horizontal-vs-vertical-scalability-for-feature-engineering
- systems/apache-spark · systems/databricks · systems/sagemaker-processing-job · systems/numba
- companies/zalando