SYSTEM Cited by 1 source
Zalando HPC Cluster¶
Definition¶
The Zalando HPC cluster is an internal high-performance computing cluster equipped with powerful GPU nodes, available to Zalando applied scientists for experiments that exceed Datalab notebook capacity or Spark batch on systems/databricks — specifically computer vision and training of large models. Access is via plain SSH — the only named interface.
Canonical disclosure¶
From the 2022-04-18 ML Platform overview (sources/2022-04-18-zalando-zalandos-machine-learning-platform):
"Some experiments require extra processing power, e.g. when they involve computer vision or training of large models. For these purposes, our applied scientists have access to a high-performance computing cluster (HPC) equipped with powerful GPU nodes. Using the HPC is as easy as connecting to it via SSH."
Role in the ML platform¶
- Third experimentation substrate. Alongside Datalab (prototyping, quick feedback) and systems/databricks (big-data Spark), the HPC cluster is Zalando's GPU-bound experimentation substrate — the workload class that neither notebooks nor Spark handle well.
- Named workload classes: computer vision; training of "large models." The 2022 post does not specify the scale ceiling.
- Interface: SSH. The post's framing positions SSH accessibility as a feature — deliberately low-friction, contrasting with more complex cluster submission systems.
Wiki positioning¶
- First named wiki instance of a Tier-2-retailer internal HPC cluster used as a pre-managed-services experimentation substrate for CV / large-model training.
- Operator of the cluster: the 2022 post names "a dedicated team provides support and improvements to our JupyterHub installation and the HPC cluster" as a single combined central team within Zalando ML Platform.
- Scheduler, queue model, quota policy, and actual GPU generation are not disclosed. Stub page — expand when Zalando publishes.
Comparison to cloud-managed alternatives¶
The 2022 post notably places the in-house HPC cluster alongside (not instead of) AWS-managed options like SageMaker training jobs. Zalando runs both: the HPC cluster for interactive / research-heavy experimentation, and SageMaker for production pipeline training jobs (via systems/zflow). This is a pragmatic two-track setup rather than a cloud-only or on-prem-only stance.
Seen in¶
- sources/2022-04-18-zalando-zalandos-machine-learning-platform — canonical disclosure. SSH-accessed GPU cluster for CV and large-model training. Operated jointly with JupyterHub by a single dedicated ML Platform central team.
Related¶
- systems/datalab-zalando — complementary interactive notebook substrate, co-operated by the same central team.
- systems/databricks — complementary Spark substrate for big-data (non-GPU) workloads.
- systems/aws-sagemaker-ai — the managed-cloud training substrate used by production pipelines via zflow; the HPC cluster is its experimentation-altitude parallel.
- systems/zflow — authors do not use the HPC cluster from zflow; zflow targets SageMaker / Databricks / Lambda for production.
- companies/zalando