Skip to content

PATTERN Cited by 1 source

Config-driven DAG generation

Definition

Config-driven DAG generation is the platform pattern in which customers declare a pipeline with a high-level configuration file (JSON/YAML/SQL + metadata) and a platform-owned codegen step compiles the configuration into an orchestrator DAG (most often Airflow) that runs as a production pipeline. Customers never write DAG code themselves; the platform owns the DAG boilerplate — monitoring, data-quality gates, alerting, metadata tagging, retry policies — and the customer owns only feature- or dataset-specific logic (the query, the schedule, the ownership metadata).

Why the pattern exists

Directly hand-writing Airflow (or equivalent) DAGs for every pipeline in a large org fails on three axes:

  1. Quality drift — different authors catch different error cases; some pipelines have data-quality gates, some don't; some tag the catalog, some don't.
  2. Upgrade cost — changing DAG-level defaults (a new SLO metric, a new lineage hook, a new secret-fetch pattern) requires updating N hand-written DAGs, not updating one codegen template.
  3. User-interface bloat — the average feature author doesn't know Airflow and doesn't want to; asking them to write Python operators slows iteration.

Codegen inverts this: the platform's Airflow expertise becomes a compile target, not a tax on every pipeline author.

Typical shape

  1. Customer writes two files:
  2. A query / transformation (SQL, Spark SQL, Python fn).
  3. A JSON config declaring: feature name, owner, schedule, urgency tier, versioning, data types, lineage hints.
  4. A platform-owned cron / reconciler reads the configs from a known location (git repo, S3 prefix, config service).
  5. For each config, it generates a fully-formed DAG — the DAG is production-ready out of the box: executes the query, integrates data-quality checks, writes to required sinks, publishes metadata to a discovery catalog, emits metrics.
  6. Generated DAGs are deployed to the orchestrator (Airflow scheduler / Astronomer / managed service).

Lyft Feature Store instance

Named example from the 2026-01-06 Lyft Feature Store post:

  • Input: a SparkSQL query + JSON metadata config.
  • Codegen: "A Python cron service reads these configurations and automatically generates an Astronomer-hosted Airflow Directed Acyclic Graph (DAG). Crucially, these generated DAGs are production-ready out-of-the-box."
  • Generated-DAG responsibilities:
  • Executing the SparkSQL query to compute feature data.
  • Storing feature data to both offline (Hive) and online (dsfeatures) data paths.
  • Running integrated data-quality checks.
  • Compatibility for feature discovery — tagging Amundsen metadata.

Complementary development UX: Lyft's homegrown Kyte environment (an Airflow-local CLI) lets developers validate configs, test SQL runs, execute DAGs locally, and backfill historical dates before committing — so the codegen's quality tier is matched by a local iteration loop.

When to use it

  • Large org with many similar pipelines (hundreds of features, hundreds of ELT jobs, hundreds of model training runs).
  • The pipelines share a strong shape — same sinks, same quality gates, same metadata schema, same scheduling granularity.
  • The platform team can afford the codegen investment (the codegen is itself production-grade code that must keep up with Airflow-version changes, store-API changes, data-quality-gate evolution).

When not to use it

  • Pipelines have heterogeneous shape — each has its own operators, its own retries, its own alerts. Codegen becomes a straitjacket.
  • Only a handful of pipelines; the codegen investment is not amortised.
  • The DAG-author population is small and expert enough to hand-write DAGs productively.

Seen in

Last updated · 319 distilled / 1,201 read