Skip to content

PATTERN Cited by 1 source

PySpark preprocessing to Python transformation split

Problem

Feature-engineering pipelines need to do two different things:

  1. Heavy-volume SQL-like reshaping — joins, filters, aggregations over raw event data across millions of entities.
  2. Library-ecosystem-heavy transformation — encoding, normalisation, statistical feature generation using Pandas / scikit-learn / NumPy / Numba idioms.

These two workloads have different scaling characteristics:

  • (1) scales horizontally — add PySpark workers.
  • (2) scales vertically — a bigger single instance with vectorisation / JIT.

Forcing both into one tier (all PySpark, or all single-machine Python) leaves throughput on the table and makes the pipeline harder to debug.

Pattern

Explicitly split feature engineering into two tiers with different compute substrates:

  1. Horizontal tier — PySpark on a Spark platform.
  2. Databricks / EMR / self-managed Spark.
  3. Writes pre-processed output to a columnar format on object storage (Delta Lake / Parquet / Iceberg).
  4. Operations: joins, filters, aggregations, window functions.
  5. Output: clean per-entity time-series or feature vectors.

  6. Vertical tier — single-process Python on a managed container.

  7. SageMaker Processing Job / AWS Batch / ECS task / local process.
  8. Libraries: Pandas, scikit-learn, NumPy, Numba, SciPy.
  9. Operations: encoding, normalisation, probabilistic feature generation, model-specific transformations.

The hand-off between tiers is clean-schema output on S3 (or equivalent) — no shared cluster, no live data channel, no tight coupling.

Why this is a distinct pattern

Unlike the "always use Spark" or "always use Python" camps, this pattern is a deliberate architectural acknowledgement that the two tiers have irreconcilable scaling profiles — scikit-learn / NumPy / Numba don't natively distribute, and forcing them to is a fool's errand. Pick the right tool at each stage.

Canonical instance (Zalando ZEOS)

Zalando's ZEOS inventory-optimisation system uses this exact split for both its demand forecaster and its inventory optimiser. Architectural justification verbatim:

"PySpark enables horizontal scalability in the number of worker nodes as data volume grows." — pre-processing tier.

"Dependent libraries lack native distribution support, so we rely on vertical scalability to handle increasing data volumes." — transformation tier.

See concepts/data-preprocessing-vs-data-transformation-split for Zalando's full architectural vocabulary around this split.

Seen in

Last updated · 501 distilled / 1,218 read