PATTERN Cited by 1 source
Config-driven DAG generation¶
Definition¶
Config-driven DAG generation is the platform pattern in which customers declare a pipeline with a high-level configuration file (JSON/YAML/SQL + metadata) and a platform-owned codegen step compiles the configuration into an orchestrator DAG (most often Airflow) that runs as a production pipeline. Customers never write DAG code themselves; the platform owns the DAG boilerplate — monitoring, data-quality gates, alerting, metadata tagging, retry policies — and the customer owns only feature- or dataset-specific logic (the query, the schedule, the ownership metadata).
Why the pattern exists¶
Directly hand-writing Airflow (or equivalent) DAGs for every pipeline in a large org fails on three axes:
- Quality drift — different authors catch different error cases; some pipelines have data-quality gates, some don't; some tag the catalog, some don't.
- Upgrade cost — changing DAG-level defaults (a new SLO metric, a new lineage hook, a new secret-fetch pattern) requires updating N hand-written DAGs, not updating one codegen template.
- User-interface bloat — the average feature author doesn't know Airflow and doesn't want to; asking them to write Python operators slows iteration.
Codegen inverts this: the platform's Airflow expertise becomes a compile target, not a tax on every pipeline author.
Typical shape¶
- Customer writes two files:
- A query / transformation (SQL, Spark SQL, Python fn).
- A JSON config declaring: feature name, owner, schedule, urgency tier, versioning, data types, lineage hints.
- A platform-owned cron / reconciler reads the configs from a known location (git repo, S3 prefix, config service).
- For each config, it generates a fully-formed DAG — the DAG is production-ready out of the box: executes the query, integrates data-quality checks, writes to required sinks, publishes metadata to a discovery catalog, emits metrics.
- Generated DAGs are deployed to the orchestrator (Airflow scheduler / Astronomer / managed service).
Lyft Feature Store instance¶
Named example from the 2026-01-06 Lyft Feature Store post:
- Input: a SparkSQL query + JSON metadata config.
- Codegen: "A Python cron service reads these configurations and automatically generates an Astronomer-hosted Airflow Directed Acyclic Graph (DAG). Crucially, these generated DAGs are production-ready out-of-the-box."
- Generated-DAG responsibilities:
- Executing the SparkSQL query to compute feature data.
- Storing feature data to both offline
(Hive) and online
(
dsfeatures) data paths. - Running integrated data-quality checks.
- Compatibility for feature discovery — tagging Amundsen metadata.
Complementary development UX: Lyft's homegrown Kyte environment (an Airflow-local CLI) lets developers validate configs, test SQL runs, execute DAGs locally, and backfill historical dates before committing — so the codegen's quality tier is matched by a local iteration loop.
When to use it¶
- Large org with many similar pipelines (hundreds of features, hundreds of ELT jobs, hundreds of model training runs).
- The pipelines share a strong shape — same sinks, same quality gates, same metadata schema, same scheduling granularity.
- The platform team can afford the codegen investment (the codegen is itself production-grade code that must keep up with Airflow-version changes, store-API changes, data-quality-gate evolution).
When not to use it¶
- Pipelines have heterogeneous shape — each has its own operators, its own retries, its own alerts. Codegen becomes a straitjacket.
- Only a handful of pipelines; the codegen investment is not amortised.
- The DAG-author population is small and expert enough to hand-write DAGs productively.
Related¶
- systems/apache-airflow
- systems/lyft-feature-store
- concepts/feature-store
- concepts/feature-discoverability
- patterns/hybrid-batch-streaming-ingestion — codegen is the delivery mechanism for the batch lane in a hybrid ingestion design.
Seen in¶
- sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution — canonical named instance: SparkSQL + JSON → Python-cron- generated Astronomer Airflow DAGs with built-in data-quality + Amundsen tagging.