Skip to content

PATTERN Cited by 1 source

Per-PR Airflow environment via DAG versioning

Pattern

Give each pull request its own isolated pipeline environment on a shared Airflow server (not a per-PR server) by:

  1. Packaging the branch's DAG code as a single zip named for the branch (feature1.zip) — DAG zip packaging.
  2. Forking airflow.models.DAG so that every DAG's id is rewritten at init to inject the branch name (qu.test_dagqu.feature1.test_dag) — DAG id rewriting via library fork.
  3. Pairing the pipeline env with a matching data environment using views over copies so creation is cheap.
  4. Cleaning up on PR close via a separate cron that deletes the zip, the unpackaged directory, and all metastore DAG rows for the environment.

Creation time: <1 minute per PR. Source: sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning.

Problem it solves

Batch pipelines with no ground truth (Zalando's marketing ROI pipeline being the canonical instance) require full end-to-end runs to validate any component change. Multiple teams shipping in parallel on one shared test server → shared test environment contention. Two practical alternatives and their costs:

Approach Cost Verdict
Serialise PRs on one shared test env Delivery velocity collapses at 2+ in-flight PRs Unacceptable
Per-PR MWAA server ~30 min provisioning + $ per PR Too slow / too expensive
Per-PR env on shared Airflow via DAG versioning <1 min, no new compute This pattern

The two halves

Compute-side isolation (pipeline env):

  • One zip per branch, deployed to the shared scheduler's dags/ folder.
  • DAG id rewritten at init from {team}.{rest} to {team}.{branch}.{rest}.
  • Branch name also appended as a tag so the UI filters by env.
  • An unpackaged copy deployed beside the zip + added to template_searchpath so Jinja templated file reads still work.

Data-side isolation (data env):

  • Per-branch suffixed Spark/Hive databases (db_attribution_feature1).
  • Read tables populated via CREATE VIEW … AS SELECT * FROM _test — no data motion.
  • Output tables get real tables, optionally seeded with a partition range.

1-to-1 binding: pipeline env feature1 reads/writes data env feature1. Isolating only one half re-creates the contention.

Cleanup

A separate cron polls GitHub PR status. When a PR closes, the cron:

  1. Deletes feature1.zip from the dags/ folder.
  2. Deletes the unpackaged directory.
  3. Queries the Airflow metastore for all DAGs tagged feature1 and deletes them via the Airflow CLI.

Cleanup is intentionally out-of-band from the DAG class fork — lifecycle logic doesn't belong in a library monkey-patch.

Tradeoffs

  • Cheap isolation — no new servers, <1 min creation.
  • Per-branch dependency isolation — zips don't share Python dependencies with each other or the scheduler.
  • Arbitrary parallel PRs — bounded only by scheduler parse load, not by server count.
  • Airflow library fork is a maintenance tax — every upgrade requires re-applying dag.py.
  • Jinja file-template gotcha — requires the unpackaged-copy workaround.
  • Requires team-prefixed DAG id schema — flat namespaces need a different rewrite rule.
  • Scheduler parse load grows with PR count — bounded by the single-server throughput.
  • Metastore bloat — thousands of per-PR DAG rows if cleanup cron falls behind.

Seen in

Last updated · 550 distilled / 1,221 read