Zalando — Accelerate testing in Apache Airflow through DAG versioning¶
Summary¶
Zalando's Performance Marketing org runs a marketing ROI (return-on-investment) pipeline — a batch data + ML pipeline powered by Databricks Spark and orchestrated by Apache Airflow, composed of sub-pipelines owned by different cross-functional teams (applied science, engineering, product) and built in part on Zalando's zFlow Python SDK. Because the ROI output has no ground truth, any change to an input or a component must be validated by running the full pipeline end-to-end — which collides with the standard "one test server, one staging env" setup whenever multiple teams ship in parallel. Their fix is pipeline-environment versioning: each pull request gets its own named Airflow "environment" on the shared test server (created in <1 min, vs ~30 min to provision a fresh MWAA server), and a matching data environment of Spark databases with a per-PR suffix on S3. The mechanism is a fork of Airflow's DAG.__init__ that injects the feature-branch name into dag_id, combined with deploying the DAG code as a zip (Airflow's Packaging DAGs feature) so every PR is a separate, isolated, dependency-pinned deployable. A cron job watches PR status and deletes the environment (zip + unpackaged dir + all metastore DAG rows) when the PR closes. The data side uses views instead of copies — each new data environment is populated with CREATE VIEW db_attribution_feature1.m_events AS SELECT * FROM db_attribution_test.m_events for input tables, with a partition-ranged copy only for output tables that need seed data.
Key takeaways¶
- Per-PR Airflow environments replace the shared-test-env bottleneck. The baseline was two servers (prod + test) and one active DAG per environment; multiple in-flight features had to either share the test server (conflicts) or test sequentially (delays). A new environment is created on the existing test server in under one minute when a PR is opened; MWAA-equivalent per-PR-server would have taken ~30 minutes plus compute cost. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
- Airflow has no native notion of "environment" — Zalando added one by forking
dag.py. TheirDAGclass override reads the file path of the Python file that initialised the DAG (e.g./usr/local/airflow/dags/feature1.zip/qu/main/file.py), extracts the zip name (feature1), and rewrites the DAG id fromqu.test_dagtoqu.feature1.test_dag. It also appends the feature name as a tag so the UI can filter by environment. Same deployed file, different DAG id per env. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning) - Zip-packaged DAG deploys give per-PR dependency isolation — at the cost of breaking Jinja. Airflow's Packaging DAGs feature ships a zip per branch (
feature1.zip), and only the dependencies inside the zip are used — no cross-feature dependency conflict. The downside is Jinja templated files can't be read from inside the zip; Zalando works around it by also deploying an unpackaged copy to a sibling directory and adding that path totemplate_searchpath. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning) - Environment cleanup is a separate cron, not an Airflow concern. A dedicated cron polls GitHub PR statuses; when a PR closes, it deletes the zip, the unpackaged directory, and then uses the Airflow CLI against the metastore to drop the now-orphaned per-environment DAG rows. This keeps environment-lifecycle logic out of the DAG library fork. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
- Data environments use views over copies to make creation cheap. The data layer is Spark databases stored on S3; each env has a suffix (
_live,_test,_feature1). Read tables in a new environment aren't copied — a script generates e.g.CREATE VIEW db_attribution_feature1.m_events AS SELECT * FROM db_attribution_test.m_events. Only tables that are written by tasks in this pipeline get a real table (optionally seeded with a partition range pulled from test). This collapses what would otherwise be a multi-hour data copy into view DDL. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning) - Pipeline-env ↔ data-env is a 1-to-1 binding. A pipeline environment (set of Airflow DAGs on one server) reads/writes exactly one data environment. The abstraction is two layers (compute / data) precisely because compute-side isolation (multiple DAGs per server) without data-side isolation (shared
_testtables) would still produce cross-PR conflicts on the data layer. Both halves have to be versioned. (Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning)
Systems extracted¶
- systems/zalando-marketing-roi-pipeline (new) — the ROI pipeline itself: Databricks Spark + Airflow, composed of sub-components (input data prep, attribution model, incremental profit forecast) owned by separate Performance Marketing teams, built partly on zFlow.
- systems/apache-airflow — substrate; extended with per-branch DAG versioning via fork.
- systems/zflow — Zalando's ML workflow SDK; some ROI sub-pipelines are built on it.
- systems/aws-s3 — backing storage for Spark tables.
- Databricks Spark — compute for the batch pipeline.
- systems/mwaa (new) — named as the "trivial way" alternative (~30 min / new server / per-PR), which this article's approach avoids.
Concepts extracted¶
- concepts/pipeline-environment (new) — a named version of a pipeline (set of Airflow DAGs) deployed to an Airflow server such that multiple versions can coexist.
- concepts/data-environment (new) — a set of Spark/Hive databases / tables / views (with a naming suffix) that a pipeline environment reads/writes.
- concepts/dag-id-rewriting (new) — injecting the environment name into every DAG's id at init so multiple versions of the "same" DAG coexist on one server.
- concepts/airflow-dag-zip-packaging (new) — Airflow's Packaging DAGs feature: each deployment is a zip containing its own dependencies, isolating cross-deployment dependency conflicts.
- concepts/per-pr-ephemeral-environment (new) — per-pull-request isolated environment on shared infrastructure, created on PR-open and destroyed on PR-close.
- concepts/view-based-data-environment (new) — populating a new data environment's read tables with
CREATE VIEW … AS SELECT *over the source environment rather than copying rows. - concepts/shared-test-environment-contention (new) — the "one test env, many in-flight features" failure mode this design exists to fix.
Patterns extracted¶
- patterns/per-pr-airflow-environment-via-dag-versioning (new) — the headline pattern: share one Airflow server across many PRs by versioning DAG ids per branch.
- patterns/view-over-copy-for-test-data-environment (new) — create a new Spark data environment by emitting views over the source env's tables; only copy when a table is written by the pipeline under test.
- patterns/library-fork-for-dag-id-rewrite (new) — forking / monkey-patching the orchestrator library (here
airflow.DAG) to inject environment awareness that the upstream doesn't provide. - patterns/cron-driven-pr-closed-cleanup (new) — decoupling "environment teardown on PR close" into a separate cron that polls VCS state rather than coupling it to orchestrator hooks.
Operational numbers¶
- New environment creation time: under 1 minute on the existing test server, vs up to 30 minutes for a fresh MWAA server.
- Environment lifetime: one open PR (minutes to days), not the hours of a typical ephemeral integration test but not the weeks of a long-lived branch env either.
- Per-environment cost: the marginal cost of an extra zip + unpackaged dir + a handful of DAG rows in the Airflow metastore + the per-environment Spark databases (empty or view-based). No new servers.
- Scope: the post describes two example DAGs (
qu.test_dag,qu.test_dag_2) across three concurrent feature environments (feature1,feature2,feature3) — illustrative, not production scale.
Caveats¶
- The DAG class fork is a maintenance tax: every Airflow upgrade requires re-applying (or re-validating) the
dag.pypatch. Zalando accepts this because no upstream mechanism exists. - Zip packaging breaks Jinja templated file reads out of the box. The workaround (deploy an unpackaged copy to
template_searchpath) means two deployments per env — simple, but a gotcha worth knowing. - The mechanism is single Airflow server only — scale is bounded by how many per-branch DAGs the scheduler can parse and the metastore can hold. Not a story about scale-out, it's a story about cheap isolation on shared infra.
- The pattern assumes all DAGs are team-prefixed (
qu.test_dag): the fork expects{team_name}.{dag_id_suffix}and injectsfeature_namein between. A flat DAG id namespace would need a different rewrite rule. - Post is from 2022; Airflow has since added features (dataset-aware scheduling, more ergonomic packaging) — but native per-PR environments remain out of scope for Airflow, so the architectural shape still matters.
Source¶
- Original: https://engineering.zalando.com/posts/2022/06/accelerate-apache-airflow-testing-through-dag-versioning.html
- Raw markdown:
raw/zalando/2022-06-09-accelerate-testing-in-apache-airflow-through-dag-versioning-ea234779.md
Related¶
- systems/apache-airflow
- systems/zalando-marketing-roi-pipeline
- systems/zflow
- systems/mwaa
- concepts/pipeline-environment
- concepts/data-environment
- concepts/dag-id-rewriting
- concepts/airflow-dag-zip-packaging
- concepts/per-pr-ephemeral-environment
- concepts/view-based-data-environment
- concepts/shared-test-environment-contention
- patterns/per-pr-airflow-environment-via-dag-versioning
- patterns/view-over-copy-for-test-data-environment
- patterns/library-fork-for-dag-id-rewrite
- patterns/cron-driven-pr-closed-cleanup
- companies/zalando