Skip to content

PATTERN Cited by 1 source

Library fork for DAG id rewrite

Pattern

When an orchestrator library has no hook point for environment awareness but you need it, fork the library's workflow-definition class (at Zalando: airflow.models.DAG) and rewrite its identifier during __init__, injecting environment metadata derived from the deployment filesystem.

The mechanism at Zalando (from sources/2022-06-09-zalando-accelerate-testing-in-apache-airflow-through-dag-versioning):

class DAG():
    def __init__(..., dag_id, ...):
        # e.g. /usr/local/airflow/dags/feature1.zip/qu/main/file.py
        file_path = get_path_of_file_which_initialized_dag()

        # Extract env name from zip filename
        feature_name = get_zip_file_name(file_path)        # "feature1"

        # Rewrite:  qu.test_dag  ->  qu.feature1.test_dag
        dag_id = "{team_name}.{feature_name}.{dag_id.split('{team_name}.')[1]}"
        tags.append(feature_name)

        # Add unpackaged sibling dir to template_searchpath so Jinja works
        feature_dir_path = get_feature_dir_path(file_path)
        template_searchpath.add(feature_dir_path)

The filesystem path is the metadata channel: the zip filename carries the environment name, and the DAG.__init__ override reads it.

Why a fork and not a wrapper

Teams can't be required to import a ZalandoDAG class — DAGs are written by the owning teams, the fork has to be transparent. Any DAG that does from airflow import DAG gets the rewritten behaviour for free.

A wrapper class would need every team's cooperation and would break as soon as someone imported the upstream DAG directly.

Cost

  • Maintenance tax — every Airflow upgrade requires re-applying or re-validating the dag.py patch.
  • Brittle — the __init__ signature / path-resolution helpers may shift between Airflow versions.
  • No upstream fix — unlike the usual upstream-the-fix playbook, the notion of "environment" probably isn't something Airflow core wants to adopt, so this is a permanent local fork.

When it's the right call

  • You control the deployment (so you can guarantee the zip/path layout the fork depends on).
  • The abstraction is a deployment-level concept upstream is unlikely to take (per-PR envs, multi-tenant per-customer isolation on a shared scheduler, etc.).
  • You accept the maintenance tax as the cost of solving an upstream-missing abstraction.

If any of those three don't hold, try a wrapper or a runtime import-time monkey-patch first.

Seen in

Last updated · 550 distilled / 1,221 read