PATTERN Cited by 1 source

PySpark preprocessing to Python transformation split¶

Problem¶

Feature-engineering pipelines need to do two different things:

Heavy-volume SQL-like reshaping — joins, filters, aggregations over raw event data across millions of entities.
Library-ecosystem-heavy transformation — encoding, normalisation, statistical feature generation using Pandas / scikit-learn / NumPy / Numba idioms.

These two workloads have different scaling characteristics:

(1) scales horizontally — add PySpark workers.
(2) scales vertically — a bigger single instance with vectorisation / JIT.

Forcing both into one tier (all PySpark, or all single-machine Python) leaves throughput on the table and makes the pipeline harder to debug.

Pattern¶

Explicitly split feature engineering into two tiers with different compute substrates:

Horizontal tier — PySpark on a Spark platform.
Databricks / EMR / self-managed Spark.
Writes pre-processed output to a columnar format on object storage (Delta Lake / Parquet / Iceberg).
Operations: joins, filters, aggregations, window functions.
Output: clean per-entity time-series or feature vectors.
Vertical tier — single-process Python on a managed container.
SageMaker Processing Job / AWS Batch / ECS task / local process.
Libraries: Pandas, scikit-learn, NumPy, Numba, SciPy.
Operations: encoding, normalisation, probabilistic feature generation, model-specific transformations.

The hand-off between tiers is clean-schema output on S3 (or equivalent) — no shared cluster, no live data channel, no tight coupling.

Why this is a distinct pattern¶

Unlike the "always use Spark" or "always use Python" camps, this pattern is a deliberate architectural acknowledgement that the two tiers have irreconcilable scaling profiles — scikit-learn / NumPy / Numba don't natively distribute, and forcing them to is a fool's errand. Pick the right tool at each stage.

Canonical instance (Zalando ZEOS)¶

Zalando's ZEOS inventory-optimisation system uses this exact split for both its demand forecaster and its inventory optimiser. Architectural justification verbatim:

"PySpark enables horizontal scalability in the number of worker nodes as data volume grows." — pre-processing tier.

"Dependent libraries lack native distribution support, so we rely on vertical scalability to handle increasing data volumes." — transformation tier.

See concepts/data-preprocessing-vs-data-transformation-split for Zalando's full architectural vocabulary around this split.

Seen in¶

sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-dive

concepts/data-preprocessing-vs-data-transformation-split — concept page; Zalando's table of the two tiers.
concepts/horizontal-vs-vertical-scalability-for-feature-engineering
systems/apache-spark · systems/databricks · systems/sagemaker-processing-job · systems/numba
companies/zalando