SYSTEM Cited by 1 source

Zalando HPC Cluster¶

Definition¶

The Zalando HPC cluster is an internal high-performance computing cluster equipped with powerful GPU nodes, available to Zalando applied scientists for experiments that exceed Datalab notebook capacity or Spark batch on systems/databricks — specifically computer vision and training of large models. Access is via plain SSH — the only named interface.

Canonical disclosure¶

From the 2022-04-18 ML Platform overview (sources/2022-04-18-zalando-zalandos-machine-learning-platform):

"Some experiments require extra processing power, e.g. when they involve computer vision or training of large models. For these purposes, our applied scientists have access to a high-performance computing cluster (HPC) equipped with powerful GPU nodes. Using the HPC is as easy as connecting to it via SSH."

Role in the ML platform¶

Third experimentation substrate. Alongside Datalab (prototyping, quick feedback) and systems/databricks (big-data Spark), the HPC cluster is Zalando's GPU-bound experimentation substrate — the workload class that neither notebooks nor Spark handle well.
Named workload classes: computer vision; training of "large models." The 2022 post does not specify the scale ceiling.
Interface: SSH. The post's framing positions SSH accessibility as a feature — deliberately low-friction, contrasting with more complex cluster submission systems.

Wiki positioning¶

First named wiki instance of a Tier-2-retailer internal HPC cluster used as a pre-managed-services experimentation substrate for CV / large-model training.
Operator of the cluster: the 2022 post names "a dedicated team provides support and improvements to our JupyterHub installation and the HPC cluster" as a single combined central team within Zalando ML Platform.
Scheduler, queue model, quota policy, and actual GPU generation are not disclosed. Stub page — expand when Zalando publishes.

Comparison to cloud-managed alternatives¶

The 2022 post notably places the in-house HPC cluster alongside (not instead of) AWS-managed options like SageMaker training jobs. Zalando runs both: the HPC cluster for interactive / research-heavy experimentation, and SageMaker for production pipeline training jobs (via systems/zflow). This is a pragmatic two-track setup rather than a cloud-only or on-prem-only stance.

Seen in¶

sources/2022-04-18-zalando-zalandos-machine-learning-platform — canonical disclosure. SSH-accessed GPU cluster for CV and large-model training. Operated jointly with JupyterHub by a single dedicated ML Platform central team.

systems/datalab-zalando — complementary interactive notebook substrate, co-operated by the same central team.
systems/databricks — complementary Spark substrate for big-data (non-GPU) workloads.
systems/aws-sagemaker-ai — the managed-cloud training substrate used by production pipelines via zflow; the HPC cluster is its experimentation-altitude parallel.
systems/zflow — authors do not use the HPC cluster from zflow; zflow targets SageMaker / Databricks / Lambda for production.
companies/zalando