Skip to content

CONCEPT Cited by 1 source

Horizontal vs vertical scalability for feature engineering

Definition

Horizontal scalability adds more worker nodes to spread load; each worker handles a share of the data. Canonical technologies: Spark, distributed SQL engines, MapReduce.

Vertical scalability uses a bigger single machine (more CPU, more RAM, faster disk); all processing happens in-process. Canonical technologies: Pandas, scikit-learn, NumPy, Numba-JITed Python.

In feature-engineering pipelines the two are not interchangeable — the choice is forced by the library ecosystem of the code you need to run.

When horizontal wins

  • Input volume dominates. Raw event data, multi-year sales history across millions of entities, full availability logs — horizontal processing (PySpark / Spark SQL) lets you add workers to keep wall-clock under control as volume grows.
  • Operations are SQL-expressible. Joins, filters, aggregations over cleanly-typed tabular data are exactly what distributed query engines are designed for.
  • Intermediate output is small and clean. Pre-processed time-series per entity is a much smaller artefact than raw event logs — the tail stages of the pipeline can then use a different scaling strategy.

When vertical wins

  • Libraries lack native distribution support. Pandas, scikit-learn, NumPy, SciPy, Numba — the bread and butter of feature engineering — are designed for single-process execution. Ad-hoc distribution via Joblib / ray / dask adds complexity and often lower throughput per dollar than just renting a bigger EC2 instance.
  • Input volume is already pre-processed / small. Once pre-processing has reduced raw events to per-entity time-series, the transformation workload fits vertically.
  • Vectorisation wins. JIT compilers like Numba can produce 2–3× speedups over NumPy on a single box, paying most of the vertical-scale tax back.

Architectural implication

Feature-engineering pipelines are typically two-tiered:

  1. Horizontal tier — deal with the raw volume, emit clean per-entity time-series.
  2. Vertical tier — run the library-ecosystem-heavy transformations on pre-aggregated data.

See concepts/data-preprocessing-vs-data-transformation-split for the architectural vocabulary.

Canonical instance (Zalando ZEOS)

Zalando's ZEOS inventory-optimisation system uses:

  • Horizontal tierPySpark on Databricks transient job clusters (scales out as SKU count grows).
  • Vertical tierSageMaker Processing Job with Pandas / scikit-learn / NumPy / Numba (scales up as transformation complexity grows).

Verbatim Zalando:

"PySpark enables horizontal scalability in the number of worker nodes as data volume grows."

"Dependent libraries lack native distribution support, so we rely on vertical scalability to handle increasing data volumes."

Seen in

Last updated · 501 distilled / 1,218 read