Skip to content

ZALANDO 2022-06-09

Read original ↗

Zalando — Accelerate testing in Apache Airflow through DAG versioning

Summary

Zalando's Performance Marketing org runs a marketing ROI (return-on-investment) pipeline — a batch data + ML pipeline powered by Databricks Spark and orchestrated by Apache Airflow, composed of sub-pipelines owned by different cross-functional teams (applied science, engineering, product) and built in part on Zalando's zFlow Python SDK. Because the ROI output has no ground truth, any change to an input or a component must be validated by running the full pipeline end-to-end — which collides with the standard "one test server, one staging env" setup whenever multiple teams ship in parallel. Their fix is pipeline-environment versioning: each pull request gets its own named Airflow "environment" on the shared test server (created in <1 min, vs ~30 min to provision a fresh MWAA server), and a matching data environment of Spark databases with a per-PR suffix on S3. The mechanism is a fork of Airflow's DAG.__init__ that injects the feature-branch name into dag_id, combined with deploying the DAG code as a zip (Airflow's Packaging DAGs feature) so every PR is a separate, isolated, dependency-pinned deployable. A cron job watches PR status and deletes the environment (zip + unpackaged dir + all metastore DAG rows) when the PR closes. The data side uses views instead of copies — each new data environment is populated with CREATE VIEW db_attribution_feature1.m_events AS SELECT * FROM db_attribution_test.m_events for input tables, with a partition-ranged copy only for output tables that need seed data.

Key takeaways

  1. Per-PR Airflow environments replace the shared-test-env bottleneck. The baseline was two servers (prod + test) and one active DAG per environment; multiple in-flight features had to either share the test server (conflicts) or test sequentially (delays). A new environment is created on the existing test server in under one minute when a PR is opened; MWAA-equivalent per-PR-server would have taken ~30 minutes plus compute cost. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
  2. Airflow has no native notion of "environment" — Zalando added one by forking dag.py. Their DAG class override reads the file path of the Python file that initialised the DAG (e.g. /usr/local/airflow/dags/feature1.zip/qu/main/file.py), extracts the zip name (feature1), and rewrites the DAG id from qu.test_dag to qu.feature1.test_dag. It also appends the feature name as a tag so the UI can filter by environment. Same deployed file, different DAG id per env. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
  3. Zip-packaged DAG deploys give per-PR dependency isolation — at the cost of breaking Jinja. Airflow's Packaging DAGs feature ships a zip per branch (feature1.zip), and only the dependencies inside the zip are used — no cross-feature dependency conflict. The downside is Jinja templated files can't be read from inside the zip; Zalando works around it by also deploying an unpackaged copy to a sibling directory and adding that path to template_searchpath. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
  4. Environment cleanup is a separate cron, not an Airflow concern. A dedicated cron polls GitHub PR statuses; when a PR closes, it deletes the zip, the unpackaged directory, and then uses the Airflow CLI against the metastore to drop the now-orphaned per-environment DAG rows. This keeps environment-lifecycle logic out of the DAG library fork. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
  5. Data environments use views over copies to make creation cheap. The data layer is Spark databases stored on S3; each env has a suffix (_live, _test, _feature1). Read tables in a new environment aren't copied — a script generates e.g. CREATE VIEW db_attribution_feature1.m_events AS SELECT * FROM db_attribution_test.m_events. Only tables that are written by tasks in this pipeline get a real table (optionally seeded with a partition range pulled from test). This collapses what would otherwise be a multi-hour data copy into view DDL. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
  6. Pipeline-env ↔ data-env is a 1-to-1 binding. A pipeline environment (set of Airflow DAGs on one server) reads/writes exactly one data environment. The abstraction is two layers (compute / data) precisely because compute-side isolation (multiple DAGs per server) without data-side isolation (shared _test tables) would still produce cross-PR conflicts on the data layer. Both halves have to be versioned. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)

Systems extracted

  • systems/zalando-marketing-roi-pipeline (new) — the ROI pipeline itself: Databricks Spark + Airflow, composed of sub-components (input data prep, attribution model, incremental profit forecast) owned by separate Performance Marketing teams, built partly on zFlow.
  • systems/apache-airflow — substrate; extended with per-branch DAG versioning via fork.
  • systems/zflow — Zalando's ML workflow SDK; some ROI sub-pipelines are built on it.
  • systems/aws-s3 — backing storage for Spark tables.
  • Databricks Spark — compute for the batch pipeline.
  • systems/mwaa (new) — named as the "trivial way" alternative (~30 min / new server / per-PR), which this article's approach avoids.

Concepts extracted

Patterns extracted

Operational numbers

  • New environment creation time: under 1 minute on the existing test server, vs up to 30 minutes for a fresh MWAA server.
  • Environment lifetime: one open PR (minutes to days), not the hours of a typical ephemeral integration test but not the weeks of a long-lived branch env either.
  • Per-environment cost: the marginal cost of an extra zip + unpackaged dir + a handful of DAG rows in the Airflow metastore + the per-environment Spark databases (empty or view-based). No new servers.
  • Scope: the post describes two example DAGs (qu.test_dag, qu.test_dag_2) across three concurrent feature environments (feature1, feature2, feature3) — illustrative, not production scale.

Caveats

  • The DAG class fork is a maintenance tax: every Airflow upgrade requires re-applying (or re-validating) the dag.py patch. Zalando accepts this because no upstream mechanism exists.
  • Zip packaging breaks Jinja templated file reads out of the box. The workaround (deploy an unpackaged copy to template_searchpath) means two deployments per env — simple, but a gotcha worth knowing.
  • The mechanism is single Airflow server only — scale is bounded by how many per-branch DAGs the scheduler can parse and the metastore can hold. Not a story about scale-out, it's a story about cheap isolation on shared infra.
  • The pattern assumes all DAGs are team-prefixed (qu.test_dag): the fork expects {team_name}.{dag_id_suffix} and injects feature_name in between. A flat DAG id namespace would need a different rewrite rule.
  • Post is from 2022; Airflow has since added features (dataset-aware scheduling, more ergonomic packaging) — but native per-PR environments remain out of scope for Airflow, so the architectural shape still matters.

Source

Last updated · 550 distilled / 1,221 read