ZALANDO 2022-04-18

Zalando — Zalando's Machine Learning Platform¶

Summary¶

Zalando's ML Platform team publishes an overview of the full ML practitioner stack serving recommender-system, size-recommendation, and demand-forecast use cases for 46 million customers. The post walks from experimentation (Datalab — a hosted JupyterHub + R Studio environment with pre-wired access to S3, BigQuery, and MicroStrategy; systems/databricks for Spark; a GPU HPC cluster for compute-vision and large-model training) through to production pipelines authored in systems/zflow (Zalando's Python DSL that compiles to CloudFormation via AWS CDK, then deploys a Step Functions state machine invoking SageMaker training / batch-transform / endpoints, systems/databricks jobs, and Lambdas). Operational visibility is a Backstage-based ML portal that overlays pipeline execution state, per-run metrics, and model cards on top of the Step Functions state machine.

The second half describes the distributed organisational structure: over a hundred product teams own their own ML work while a handful of central teams build and operate the shared tools (Datalab + HPC; zflow + monitoring; ML consultants who pair with product teams; a research team). This is the canonical public disclosure of Zalando's central-platform + per-team ownership model and of the six-step end-to-end flow from pipeline Python script to running SageMaker endpoint.

Key takeaways¶

Datalab is a hosted multi-tool notebook environment, not just JupyterHub. Zalando exposes JupyterHub, R Studio, and "other tools they may need" behind one web-browser surface, pre-wired to S3, BigQuery, MicroStrategy and other internal data sources, with web-based shell access. Explicit value prop: "its users don't have to worry about setting up the necessary tools and clients on their own laptops. Instead, they're ready to start experimenting in less than a minute." (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)
Three experimentation substrates, each for a different workload shape. Datalab for "prototyping and quick feedback"; systems/databricks for big-data Spark workloads ("Apache Spark is much better suited for that purpose"); a GPU HPC cluster for compute-vision and large-model training accessed via SSH. The post's framing: "Some experiments require extra processing power, e.g. when they involve computer vision or training of large models." (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)
ML pipelines are infrastructure-as-code that gets deployed through the same CDP as other services. The six-step flow, verbatim condensed: (i) write pipeline in Python with systems/zflow DSL; (ii) zflow invokes AWS CDK to synthesise a CloudFormation template; (iii) commit
push template to git; (iv) Zalando Continuous Delivery Platform (CDP) deploys the template to AWS; (v) CloudFormation materialises a Step Functions state machine + supporting Lambdas + IAM policies; (vi) pipeline executes via scheduler, manual Console click, or API call. (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)
zflow exists because CloudFormation templates are "verbose and tedious to edit manually." First explicit motivation on the wiki for a Python DSL wrapping CloudFormation. zflow wraps AWS CDK (so Zalando does not hand-write YAML templates and does not directly import CDK either) and adds ML-specific primitives (training, batch-transform, hyperparameter tuning, flow control for conditional + parallel stages) on top. Type hints in zflow give "warnings go beyond simple syntax checks available for JSON and YAML templates." Canonical quote: "Unfortunately, CF files can become verbose and are tedious to edit manually. We addressed this problem by creating zflow, a Python tool for building machine learning pipelines." zflow has been used to create hundreds of pipelines at Zalando since its 2019 introduction. (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)
zflow has been Zalando's ML orchestration choice since early 2019. "In early 2019 we at Zalando decided to use AWS Step Functions for orchestrating machine learning pipelines." The 2022 post names the fit reason explicitly: Zalando already used AWS as main cloud provider, and Step Functions' integrations with Lambda, S3, SageMaker, and Databricks API (the latter via Lambda) covered the full ML workflow surface. This is the architectural decision behind the 2021-02-15 Payments retrospective (). (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)
A systems/backstage-based ML portal is the pipeline-observability surface. "Pipeline tracking is a part of the internal Zalando developer portal running on top of Backstage, an open-source platform for building such portals." Named capabilities: real-time pipeline-execution view, per-run metric evolution across multiple training-pipeline executions (graphed), and model cards for models produced by the pipelines. This is the first named wiki instance of the developer- portal-as-ML-control-plane pattern. (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)
Over one hundred product teams own their own ML; a handful of central teams own the tools. Zalando explicitly describes a distributed setup with central support: "most expertise is spread across over a hundred product teams working in their specific business domains" with "dedicated software engineers and applied scientists." Central-team decomposition: Datalab/HPC team operates JupyterHub + HPC cluster; two teams develop zflow and pipeline-monitoring tools; ML consultants pair-program + train + advise product teams; a research team explores state-of-the-art. Plus a data science community for cross-team workshops, reading groups, and an annual internal conference. Canonical wiki reference for the central platform + internal consulting shape. (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)
Pipeline entry-points at Zalando are always code first. The zflow code listing shows the idiom: declare stages (databricks_job, training_job, batch_transform_job), build a PipelineBuilder, chain .add_stage(...) calls in order, wrap in a StackBuilder, call stack.generate(output_location=...). Zalando does not use the Step Functions visual editor in anger — the canonical authoring surface is a Python script committed to git, not a JSON/YAML template and not the AWS Console. (Source: sources/2022-04-18-zalando-zalandos-machine-learning-platform)

Systems named¶

systems/jupyterhub — multi-user notebook hub.
systems/datalab-zalando — Zalando's hosted JupyterHub + R Studio + pre-wired data-source access environment.
systems/databricks — Spark substrate for big-data ML work.
systems/zalando-hpc-cluster — GPU high-performance computing cluster for CV + large-model training; SSH access.
systems/aws-step-functions — Zalando's ML pipeline orchestration primitive since early 2019.
systems/aws-lambda — glue compute between pipeline steps; also the bridge for invoking Databricks API from Step Functions.
systems/aws-sagemaker-ai — training jobs, batch-transform jobs.
systems/aws-sagemaker-endpoint — real-time inference endpoints.
systems/cloudformation — deployment artefact for the Step Functions state machine + supporting Lambdas + IAM policies.
systems/aws-cdk — invoked internally by zflow to synthesise the CloudFormation template.
systems/zflow — Zalando's Python pipeline DSL.
systems/zalando-ml-portal-backstage — the ML observability web UI built on Backstage.
systems/backstage — Spotify's open-source developer-portal platform; Zalando uses it as the substrate for the internal ML portal.
Zalando Continuous Delivery Platform (CDP) — the internal deploy bus named but not expanded; ingests the committed CloudFormation template and deploys it to AWS.

Concepts introduced¶

concepts/notebook-experimentation-platform — a hosted, pre-wired multi-notebook environment as the ML practitioner's first-30-seconds entry point.
concepts/cloudformation-verbosity-problem — the operationally-motivated reason teams build higher-level IaC wrappers (CDK, SAM, zflow, etc.) instead of hand-editing CloudFormation YAML.

Patterns introduced¶

patterns/python-dsl-wrapping-cloudformation — Python DSL whose .generate() call emits a CloudFormation template (via CDK or directly), committed to git and deployed through the normal CD pipeline. zflow is the canonical ML-shaped instance.
patterns/ml-platform-internal-consulting-team — central ML Platform team ships shared tools + runs an internal consulting arm (pair programming, training, architectural advice) while product teams own their own ML work.
patterns/web-portal-for-ml-pipeline-observability — a developer-portal-as-control-plane for ML pipelines: execution state, metrics, model cards on top of the underlying workflow engine's primitives.

Operational numbers¶

46 million customers served by ML use cases (recommenders, size recommendation, demand forecasting).
Over 100 product teams with their own software engineers and applied scientists using the platform.
Hundreds of pipelines created with zflow since its 2019 introduction.
< 1 minute to begin experimenting in Datalab "its users … are ready to start experimenting in less than a minute."
Annual internal conference + reading groups + expert talks + workshops across the Zalando data science community.

Caveats¶

The post is a platform-team overview, not an architecture deep dive. Internals of zflow (IR, caching semantics, retry policies, model registry) are not disclosed — stubs exist where named but the post does not crack them open.
Datalab's concrete hosting substrate, authentication, and multi-tenancy model are not described; only the user-facing capability and the pre-wired data-source access are named.
The "other tools" available in Datalab beyond JupyterHub and R Studio are not enumerated.
The HPC cluster's job scheduler (Slurm? LSF? custom?) is not named — SSH access is the only named interface.
CDP internals are not described; it is named as the pipe that moves the committed CloudFormation template to AWS and left at that.
Concrete model cards content / schema and the metric graphs implementation in the Backstage portal are not shown.

Source¶

companies/zalando — company anchor; axis 11 (ML Platform) opens with this post.
— Zalando Payments' real-time-inference retrospective; a specific-workload instance of the platform described here.
systems/zflow · systems/datalab-zalando · systems/zalando-hpc-cluster · systems/zalando-ml-portal-backstage · systems/backstage
patterns/python-dsl-wrapping-cloudformation · patterns/ml-platform-internal-consulting-team · patterns/web-portal-for-ml-pipeline-observability · patterns/managed-services-over-custom-ml-platform
concepts/notebook-experimentation-platform · concepts/cloudformation-verbosity-problem