PATTERN Cited by 1 source

Per-PR Airflow environment via DAG versioning¶

Pattern¶

Give each pull request its own isolated pipeline environment on a shared Airflow server (not a per-PR server) by:

Packaging the branch's DAG code as a single zip named for the branch (feature1.zip) — DAG zip packaging.
Forking airflow.models.DAG so that every DAG's id is rewritten at init to inject the branch name (qu.test_dag → qu.feature1.test_dag) — DAG id rewriting via library fork.
Pairing the pipeline env with a matching data environment using views over copies so creation is cheap.
Cleaning up on PR close via a separate cron that deletes the zip, the unpackaged directory, and all metastore DAG rows for the environment.

Creation time: <1 minute per PR. Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning.

Problem it solves¶

Batch pipelines with no ground truth (Zalando's marketing ROI pipeline being the canonical instance) require full end-to-end runs to validate any component change. Multiple teams shipping in parallel on one shared test server → shared test environment contention. Two practical alternatives and their costs:

Approach	Cost	Verdict
Serialise PRs on one shared test env	Delivery velocity collapses at 2+ in-flight PRs	Unacceptable
Per-PR MWAA server	~30 min provisioning + $ per PR	Too slow / too expensive
Per-PR env on shared Airflow via DAG versioning	<1 min, no new compute	This pattern

The two halves¶

Compute-side isolation (pipeline env):

One zip per branch, deployed to the shared scheduler's dags/ folder.
DAG id rewritten at init from {team}.{rest} to {team}.{branch}.{rest}.
Branch name also appended as a tag so the UI filters by env.
An unpackaged copy deployed beside the zip + added to template_searchpath so Jinja templated file reads still work.

Data-side isolation (data env):

Per-branch suffixed Spark/Hive databases (db_attribution_feature1).
Read tables populated via CREATE VIEW … AS SELECT * FROM _test — no data motion.
Output tables get real tables, optionally seeded with a partition range.

1-to-1 binding: pipeline env feature1 reads/writes data env feature1. Isolating only one half re-creates the contention.

Cleanup¶

A separate cron polls GitHub PR status. When a PR closes, the cron:

Deletes feature1.zip from the dags/ folder.
Deletes the unpackaged directory.
Queries the Airflow metastore for all DAGs tagged feature1 and deletes them via the Airflow CLI.

Cleanup is intentionally out-of-band from the DAG class fork — lifecycle logic doesn't belong in a library monkey-patch.

Tradeoffs¶

✅ Cheap isolation — no new servers, <1 min creation.
✅ Per-branch dependency isolation — zips don't share Python dependencies with each other or the scheduler.
✅ Arbitrary parallel PRs — bounded only by scheduler parse load, not by server count.
❌ Airflow library fork is a maintenance tax — every upgrade requires re-applying dag.py.
❌ Jinja file-template gotcha — requires the unpackaged-copy workaround.
❌ Requires team-prefixed DAG id schema — flat namespaces need a different rewrite rule.
❌ Scheduler parse load grows with PR count — bounded by the single-server throughput.
❌ Metastore bloat — thousands of per-PR DAG rows if cleanup cron falls behind.

patterns/library-fork-for-dag-id-rewrite — the mechanism that makes the DAG id rewrite possible.
patterns/view-over-copy-for-test-data-environment — the data-side half.
patterns/cron-driven-pr-closed-cleanup — the lifecycle half.
patterns/ephemeral-preview-environments — related shape for application/service stacks (IaC-defined preview stack per PR), distinct substrate.

Seen in¶

sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning — Zalando Performance Marketing runs its marketing ROI pipeline under this exact shape.