Skip to content

SYSTEM Cited by 2 sources

zflow

zflow is Zalando's internal Python workflow library for machine-learning pipelines. Built by the Zalando Machine Learning Platform team, it is a thin orchestration layer on top of systems/aws-step-functions, systems/aws-lambda, Amazon SageMaker, and systems/databricks Spark. Data scientists and engineers declare ML workflows in Python and zflow translates them into Step Functions state machines that invoke SageMaker training jobs, SageMaker batch transforms, SageMaker endpoints, Databricks jobs, and Lambdas as workflow steps.

Its publicly named role on the wiki is as the authoring substrate for the 2020–2021 Zalando Payments risk-scoring pipeline migration away from a legacy Scala + Spark monolith.

Role in the ML platform

  • Workflow authoring — Python-native DSL; declares steps and dependencies; compiles down to an Step Functions state machine.
  • Step heterogeneity — one zflow workflow can mix SageMaker training jobs, SageMaker batch-transform jobs, SageMaker endpoint deployments, and Databricks Spark jobs in a single orchestration.
  • Scheduling — users "easily orchestrate and schedule ML workflows" per the Zalando post.
  • Abstraction goal"we steer away from implementing the whole system from scratch"; consumers use zflow instead of wiring Step Functions, SageMaker SDK, and Databricks APIs themselves.

How it compiles (from the 2022 ML Platform post)

zflow is not itself a runtime; it is a Python DSL that generates a CloudFormation template via AWS CDK and hands it to Zalando's standard continuous delivery pipeline:

  1. Author — user writes a Python script using zflow primitives (databricks_job, training_job, batch_transform_job, hyperparameter tuning, flow-control for conditional + parallel stages). Type hints catch errors the YAML editor never could.
  2. Compile — running the script invokes zflow's .generate() call, which internally calls AWS CDK to synthesise a CloudFormation template file.
  3. Commit — the generated template is committed + pushed to a git repository alongside the Python source.
  4. DeployZalando Continuous Delivery Platform (CDP) detects the change and applies the template to AWS CloudFormation.
  5. Materialise — CloudFormation provisions the Step Functions state machine, supporting Lambdas, IAM roles, and resource policies.
  6. Run — pipeline executes on a schedule, via manual Console click, or via API call.

Canonical quote from the 2022 post (sources/2022-04-18-zalando-zalandos-machine-learning-platform):

"When a pipeline script is executed, zflow uses AWS CDK to generate a CloudFormation template file. The file contains all the information needed to create the necessary AWS resources. All that is needed now is to commit and push the generated template to the git repository and let Zalando Continuous Delivery Platform (CDP) deploy it to AWS."

Worked example (verbatim from the post)

data_processing = databricks_job("data_processing_job")
training = training_job("training_job")
batch_inference = batch_transform_job("batch_transform_job")

pipeline = PipelineBuilder("example-pipeline")
pipeline \
    .add_stage(data_processing) \
    .add_stage(training) \
    .add_stage(batch_inference)

stack = StackBuilder("example-stack")
stack.add_pipeline(pipeline)

stack.generate(output_location="zflow_pipeline.yaml")

Scale (2022 disclosure)

  • Since its early-2019 introduction, zflow has been used to create hundreds of pipelines at Zalando.
  • Operated by two central teams within ML Platform ("Two teams actively develop zflow and monitoring tools for pipelines").
  • Consumed by 100+ product teams org-wide.

Canonical disclosure

From Zalando Payments' 2021 retrospective ():

"At Zalando, we use a tool provided by Zalando's ML Platform team called zflow. It is essentially a Python library built on top of AWS Step Functions, AWS Lambdas, Amazon SageMaker, and Databricks Spark, that allows users to easily orchestrate and schedule ML workflows."

The concrete zflow-orchestrated workflow disclosed in that post:

  1. Training data preprocessing — Databricks cluster + scikit-learn batch-transform job on SageMaker.
  2. Training — SageMaker training job.
  3. Batch predictions — SageMaker batch-transform job.
  4. Performance report — Databricks job producing a PDF.
  5. Endpoint deployment — SageMaker real-time endpoint backed by an inference pipeline model (scikit-learn preprocessing container + main-model container).

Wiki positioning

  • Open stub because the post discloses that zflow exists and what it wraps, but not its internals (IR, caching semantics, retry policies, model registry).
  • zflow is the Zalando-specific instance of the broader pattern patterns/managed-services-over-custom-ml-platform — instead of each team hand-wiring Step Functions + SageMaker, the ML Platform team productises the glue as a Python library.
  • Positions Zalando's ML Platform as an internal consulting organisation (parallel to many large-enterprise patterns); collaborations run via Statements of Work (see the Payments team's 9-month engagement).

Seen in

  • canonical first public disclosure. Authored by Zalando Payments + ML Platform teams. zflow orchestrates the five-stage workflow replacing the legacy Scala + Spark fraud-detection monolith. External reference: an ML Platform team member's LinkedIn post "Building ML workflows at Zalando: zflow" is cited by the engineering post but is not a Zalando-blog artefact.
  • sources/2022-04-18-zalando-zalandos-machine-learning-platformplatform-overview canonical disclosure. Names the architectural decision point ("In early 2019 we at Zalando decided to use AWS Step Functions for orchestrating machine learning pipelines"); discloses zflow's compilation target — a Python script that invokes AWS CDK to synthesise a CloudFormation template, which then deploys through Zalando CDP to AWS; names hundreds of pipelines created with zflow since 2019; shows the canonical Python idiom (databricks_job / training_job / batch_transform_job stages chained via PipelineBuilder, wrapped in StackBuilder, committed YAML from .generate()); explicitly motivates the DSL's existence from the CloudFormation verbosity problem. Pairs with the Backstage-based ML portal for pipeline observability — the portal reads the runtime state of the Step Functions state machine that zflow compiled to.
  • sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-divethird publicly-named zflow workload: ZEOS inventory-optimisation system. Both constituent pipelines (Demand Forecaster + Replenishment Recommender) are "implemented using zFlow, an internal machine learning ecosystem that offers seamless integration and abstractions for AWS and Databricks infrastructure. This enables us to focus on the machine learning application code without the overhead of building and maintaining complex infrastructure code." New 2025-era framing of zFlow adds two architectural commitments beyond the 2021 + 2022 disclosures: (a) in-transit and at-rest encryption for all artifacts is provided by zFlow out of the box; (b) zFlow workloads can combine batch + real-time endpoints in one pipeline (the replenishment recommender runs a daily SageMaker Batch Transform plus an online SQS/Lambda path on the same feature store). First wiki instance of a zFlow pipeline using SageMaker Feature Store in both online and offline modes.
Last updated · 542 distilled / 1,571 read