ZALANDO 2022-06-09

Zalando — Accelerate testing in Apache Airflow through DAG versioning¶

Summary¶

Zalando's Performance Marketing org runs a marketing ROI (return-on-investment) pipeline — a batch data + ML pipeline powered by Databricks Spark and orchestrated by Apache Airflow, composed of sub-pipelines owned by different cross-functional teams (applied science, engineering, product) and built in part on Zalando's zFlow Python SDK. Because the ROI output has no ground truth, any change to an input or a component must be validated by running the full pipeline end-to-end — which collides with the standard "one test server, one staging env" setup whenever multiple teams ship in parallel. Their fix is pipeline-environment versioning: each pull request gets its own named Airflow "environment" on the shared test server (created in <1 min, vs ~30 min to provision a fresh MWAA server), and a matching data environment of Spark databases with a per-PR suffix on S3. The mechanism is a fork of Airflow's DAG.__init__ that injects the feature-branch name into dag_id, combined with deploying the DAG code as a zip (Airflow's Packaging DAGs feature) so every PR is a separate, isolated, dependency-pinned deployable. A cron job watches PR status and deletes the environment (zip + unpackaged dir + all metastore DAG rows) when the PR closes. The data side uses views instead of copies — each new data environment is populated with CREATE VIEW db_attribution_feature1.m_events AS SELECT * FROM db_attribution_test.m_events for input tables, with a partition-ranged copy only for output tables that need seed data.

Key takeaways¶

Per-PR Airflow environments replace the shared-test-env bottleneck. The baseline was two servers (prod + test) and one active DAG per environment; multiple in-flight features had to either share the test server (conflicts) or test sequentially (delays). A new environment is created on the existing test server in under one minute when a PR is opened; MWAA-equivalent per-PR-server would have taken ~30 minutes plus compute cost. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
Airflow has no native notion of "environment" — Zalando added one by forking dag.py. Their DAG class override reads the file path of the Python file that initialised the DAG (e.g. /usr/local/airflow/dags/feature1.zip/qu/main/file.py), extracts the zip name (feature1), and rewrites the DAG id from qu.test_dag to qu.feature1.test_dag. It also appends the feature name as a tag so the UI can filter by environment. Same deployed file, different DAG id per env. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
Zip-packaged DAG deploys give per-PR dependency isolation — at the cost of breaking Jinja. Airflow's Packaging DAGs feature ships a zip per branch (feature1.zip), and only the dependencies inside the zip are used — no cross-feature dependency conflict. The downside is Jinja templated files can't be read from inside the zip; Zalando works around it by also deploying an unpackaged copy to a sibling directory and adding that path to template_searchpath. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
Environment cleanup is a separate cron, not an Airflow concern. A dedicated cron polls GitHub PR statuses; when a PR closes, it deletes the zip, the unpackaged directory, and then uses the Airflow CLI against the metastore to drop the now-orphaned per-environment DAG rows. This keeps environment-lifecycle logic out of the DAG library fork. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
Data environments use views over copies to make creation cheap. The data layer is Spark databases stored on S3; each env has a suffix (_live, _test, _feature1). Read tables in a new environment aren't copied — a script generates e.g. CREATE VIEW db_attribution_feature1.m_events AS SELECT * FROM db_attribution_test.m_events. Only tables that are written by tasks in this pipeline get a real table (optionally seeded with a partition range pulled from test). This collapses what would otherwise be a multi-hour data copy into view DDL. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
Pipeline-env ↔ data-env is a 1-to-1 binding. A pipeline environment (set of Airflow DAGs on one server) reads/writes exactly one data environment. The abstraction is two layers (compute / data) precisely because compute-side isolation (multiple DAGs per server) without data-side isolation (shared _test tables) would still produce cross-PR conflicts on the data layer. Both halves have to be versioned. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)

Systems extracted¶

systems/zalando-marketing-roi-pipeline (new) — the ROI pipeline itself: Databricks Spark + Airflow, composed of sub-components (input data prep, attribution model, incremental profit forecast) owned by separate Performance Marketing teams, built partly on zFlow.
systems/apache-airflow — substrate; extended with per-branch DAG versioning via fork.
systems/zflow — Zalando's ML workflow SDK; some ROI sub-pipelines are built on it.
systems/aws-s3 — backing storage for Spark tables.
Databricks Spark — compute for the batch pipeline.
systems/mwaa (new) — named as the "trivial way" alternative (~30 min / new server / per-PR), which this article's approach avoids.

Concepts extracted¶

concepts/pipeline-environment (new) — a named version of a pipeline (set of Airflow DAGs) deployed to an Airflow server such that multiple versions can coexist.
concepts/data-environment (new) — a set of Spark/Hive databases / tables / views (with a naming suffix) that a pipeline environment reads/writes.
concepts/dag-id-rewriting (new) — injecting the environment name into every DAG's id at init so multiple versions of the "same" DAG coexist on one server.
concepts/airflow-dag-zip-packaging (new) — Airflow's Packaging DAGs feature: each deployment is a zip containing its own dependencies, isolating cross-deployment dependency conflicts.
concepts/per-pr-ephemeral-environment (new) — per-pull-request isolated environment on shared infrastructure, created on PR-open and destroyed on PR-close.
concepts/view-based-data-environment (new) — populating a new data environment's read tables with CREATE VIEW … AS SELECT * over the source environment rather than copying rows.
concepts/shared-test-environment-contention (new) — the "one test env, many in-flight features" failure mode this design exists to fix.

Patterns extracted¶

patterns/per-pr-airflow-environment-via-dag-versioning (new) — the headline pattern: share one Airflow server across many PRs by versioning DAG ids per branch.
patterns/view-over-copy-for-test-data-environment (new) — create a new Spark data environment by emitting views over the source env's tables; only copy when a table is written by the pipeline under test.
patterns/library-fork-for-dag-id-rewrite (new) — forking / monkey-patching the orchestrator library (here airflow.DAG) to inject environment awareness that the upstream doesn't provide.
patterns/cron-driven-pr-closed-cleanup (new) — decoupling "environment teardown on PR close" into a separate cron that polls VCS state rather than coupling it to orchestrator hooks.

Operational numbers¶

New environment creation time: under 1 minute on the existing test server, vs up to 30 minutes for a fresh MWAA server.
Environment lifetime: one open PR (minutes to days), not the hours of a typical ephemeral integration test but not the weeks of a long-lived branch env either.
Per-environment cost: the marginal cost of an extra zip + unpackaged dir + a handful of DAG rows in the Airflow metastore + the per-environment Spark databases (empty or view-based). No new servers.
Scope: the post describes two example DAGs (qu.test_dag, qu.test_dag_2) across three concurrent feature environments (feature1, feature2, feature3) — illustrative, not production scale.

Caveats¶

The DAG class fork is a maintenance tax: every Airflow upgrade requires re-applying (or re-validating) the dag.py patch. Zalando accepts this because no upstream mechanism exists.
Zip packaging breaks Jinja templated file reads out of the box. The workaround (deploy an unpackaged copy to template_searchpath) means two deployments per env — simple, but a gotcha worth knowing.
The mechanism is single Airflow server only — scale is bounded by how many per-branch DAGs the scheduler can parse and the metastore can hold. Not a story about scale-out, it's a story about cheap isolation on shared infra.
The pattern assumes all DAGs are team-prefixed (qu.test_dag): the fork expects {team_name}.{dag_id_suffix} and injects feature_name in between. A flat DAG id namespace would need a different rewrite rule.
Post is from 2022; Airflow has since added features (dataset-aware scheduling, more ergonomic packaging) — but native per-PR environments remain out of scope for Airflow, so the architectural shape still matters.