Skip to content

SYSTEM Cited by 1 source

Zalando Marketing ROI Pipeline

What

Zalando's Performance Marketing department's marketing ROI (return-on-investment) pipeline — a batch data + machine-learning pipeline that measures the return on paid advertisement campaigns. Compute is Databricks Spark; orchestration is Apache Airflow; the data layer is Spark databases backed by S3.

The pipeline is composed of sub-pipelines ("components") owned by different cross-functional teams (applied science, engineering, product) within Performance Marketing. Named examples in the source post:

  • Input data preparation
  • Marketing attribution model
  • Incremental profit forecast for campaigns

Some components are built using Zalando's in-house Python SDK zFlow.

Why it's interesting on the wiki

The ROI output has no ground truth — there is no oracle to compare a new pipeline version against. To validate any change to an input or a component, the whole pipeline has to be run end-to-end and its output inspected. That's the forcing function for Zalando's per-PR Airflow environment work: when multiple teams are editing different components of the same pipeline in parallel, they cannot share a single test environment without conflicts, and MWAA-style per-PR-new-server would take ~30 min and real cost per PR.

See sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning for the full architecture.

Last updated · 550 distilled / 1,221 read