PATTERN Cited by 1 source
Per-PR Airflow environment via DAG versioning¶
Pattern¶
Give each pull request its own isolated pipeline environment on a shared Airflow server (not a per-PR server) by:
- Packaging the branch's DAG code as a single zip named for the branch (
feature1.zip) — DAG zip packaging. - Forking
airflow.models.DAGso that every DAG's id is rewritten at init to inject the branch name (qu.test_dag→qu.feature1.test_dag) — DAG id rewriting via library fork. - Pairing the pipeline env with a matching data environment using views over copies so creation is cheap.
- Cleaning up on PR close via a separate cron that deletes the zip, the unpackaged directory, and all metastore DAG rows for the environment.
Creation time: <1 minute per PR. Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning.
Problem it solves¶
Batch pipelines with no ground truth (Zalando's marketing ROI pipeline being the canonical instance) require full end-to-end runs to validate any component change. Multiple teams shipping in parallel on one shared test server → shared test environment contention. Two practical alternatives and their costs:
| Approach | Cost | Verdict |
|---|---|---|
| Serialise PRs on one shared test env | Delivery velocity collapses at 2+ in-flight PRs | Unacceptable |
| Per-PR MWAA server | ~30 min provisioning + $ per PR | Too slow / too expensive |
| Per-PR env on shared Airflow via DAG versioning | <1 min, no new compute | This pattern |
The two halves¶
Compute-side isolation (pipeline env):
- One zip per branch, deployed to the shared scheduler's
dags/folder. - DAG id rewritten at init from
{team}.{rest}to{team}.{branch}.{rest}. - Branch name also appended as a tag so the UI filters by env.
- An unpackaged copy deployed beside the zip + added to
template_searchpathso Jinja templated file reads still work.
Data-side isolation (data env):
- Per-branch suffixed Spark/Hive databases (
db_attribution_feature1). - Read tables populated via
CREATE VIEW … AS SELECT * FROM _test— no data motion. - Output tables get real tables, optionally seeded with a partition range.
1-to-1 binding: pipeline env feature1 reads/writes data env feature1. Isolating only one half re-creates the contention.
Cleanup¶
A separate cron polls GitHub PR status. When a PR closes, the cron:
- Deletes
feature1.zipfrom thedags/folder. - Deletes the unpackaged directory.
- Queries the Airflow metastore for all DAGs tagged
feature1and deletes them via the Airflow CLI.
Cleanup is intentionally out-of-band from the DAG class fork — lifecycle logic doesn't belong in a library monkey-patch.
Tradeoffs¶
- ✅ Cheap isolation — no new servers, <1 min creation.
- ✅ Per-branch dependency isolation — zips don't share Python dependencies with each other or the scheduler.
- ✅ Arbitrary parallel PRs — bounded only by scheduler parse load, not by server count.
- ❌ Airflow library fork is a maintenance tax — every upgrade requires re-applying
dag.py. - ❌ Jinja file-template gotcha — requires the unpackaged-copy workaround.
- ❌ Requires team-prefixed DAG id schema — flat namespaces need a different rewrite rule.
- ❌ Scheduler parse load grows with PR count — bounded by the single-server throughput.
- ❌ Metastore bloat — thousands of per-PR DAG rows if cleanup cron falls behind.
Related patterns¶
- patterns/library-fork-for-dag-id-rewrite — the mechanism that makes the DAG id rewrite possible.
- patterns/view-over-copy-for-test-data-environment — the data-side half.
- patterns/cron-driven-pr-closed-cleanup — the lifecycle half.
- patterns/ephemeral-preview-environments — related shape for application/service stacks (IaC-defined preview stack per PR), distinct substrate.
Seen in¶
- sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning — Zalando Performance Marketing runs its marketing ROI pipeline under this exact shape.