PATTERN Cited by 1 source
Library fork for DAG id rewrite¶
Pattern¶
When an orchestrator library has no hook point for environment awareness but you need it, fork the library's workflow-definition class (at Zalando: airflow.models.DAG) and rewrite its identifier during __init__, injecting environment metadata derived from the deployment filesystem.
The mechanism at Zalando (from sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning):
class DAG():
def __init__(..., dag_id, ...):
# e.g. /usr/local/airflow/dags/feature1.zip/qu/main/file.py
file_path = get_path_of_file_which_initialized_dag()
# Extract env name from zip filename
feature_name = get_zip_file_name(file_path) # "feature1"
# Rewrite: qu.test_dag -> qu.feature1.test_dag
dag_id = "{team_name}.{feature_name}.{dag_id.split('{team_name}.')[1]}"
tags.append(feature_name)
# Add unpackaged sibling dir to template_searchpath so Jinja works
feature_dir_path = get_feature_dir_path(file_path)
template_searchpath.add(feature_dir_path)
The filesystem path is the metadata channel: the zip filename carries the environment name, and the DAG.__init__ override reads it.
Why a fork and not a wrapper¶
Teams can't be required to import a ZalandoDAG class — DAGs are written by the owning teams, the fork has to be transparent. Any DAG that does from airflow import DAG gets the rewritten behaviour for free.
A wrapper class would need every team's cooperation and would break as soon as someone imported the upstream DAG directly.
Cost¶
- Maintenance tax — every Airflow upgrade requires re-applying or re-validating the
dag.pypatch. - Brittle — the
__init__signature / path-resolution helpers may shift between Airflow versions. - No upstream fix — unlike the usual upstream-the-fix playbook, the notion of "environment" probably isn't something Airflow core wants to adopt, so this is a permanent local fork.
When it's the right call¶
- You control the deployment (so you can guarantee the zip/path layout the fork depends on).
- The abstraction is a deployment-level concept upstream is unlikely to take (per-PR envs, multi-tenant per-customer isolation on a shared scheduler, etc.).
- You accept the maintenance tax as the cost of solving an upstream-missing abstraction.
If any of those three don't hold, try a wrapper or a runtime import-time monkey-patch first.
Related¶
- patterns/per-pr-airflow-environment-via-dag-versioning — the enclosing pattern.
- patterns/upstream-the-fix — the contrast: when a fork should be upstreamed. DAG id rewriting for per-deployment envs probably shouldn't, so it stays a local fork.
- concepts/dag-id-rewriting
- concepts/pipeline-environment
- systems/apache-airflow
Seen in¶
- sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning — Zalando's Performance Marketing org maintains a
dag.pyfork to enable per-PR pipeline environments.