PATTERN Cited by 1 source

Config-driven DAG generation¶

Definition¶

Config-driven DAG generation is the platform pattern in which customers declare a pipeline with a high-level configuration file (JSON/YAML/SQL + metadata) and a platform-owned codegen step compiles the configuration into an orchestrator DAG (most often Airflow) that runs as a production pipeline. Customers never write DAG code themselves; the platform owns the DAG boilerplate — monitoring, data-quality gates, alerting, metadata tagging, retry policies — and the customer owns only feature- or dataset-specific logic (the query, the schedule, the ownership metadata).

Why the pattern exists¶

Directly hand-writing Airflow (or equivalent) DAGs for every pipeline in a large org fails on three axes:

Quality drift — different authors catch different error cases; some pipelines have data-quality gates, some don't; some tag the catalog, some don't.
Upgrade cost — changing DAG-level defaults (a new SLO metric, a new lineage hook, a new secret-fetch pattern) requires updating N hand-written DAGs, not updating one codegen template.
User-interface bloat — the average feature author doesn't know Airflow and doesn't want to; asking them to write Python operators slows iteration.

Codegen inverts this: the platform's Airflow expertise becomes a compile target, not a tax on every pipeline author.

Typical shape¶

Customer writes two files:
A query / transformation (SQL, Spark SQL, Python fn).
A JSON config declaring: feature name, owner, schedule, urgency tier, versioning, data types, lineage hints.
A platform-owned cron / reconciler reads the configs from a known location (git repo, S3 prefix, config service).
For each config, it generates a fully-formed DAG — the DAG is production-ready out of the box: executes the query, integrates data-quality checks, writes to required sinks, publishes metadata to a discovery catalog, emits metrics.
Generated DAGs are deployed to the orchestrator (Airflow scheduler / Astronomer / managed service).

Lyft Feature Store instance¶

Named example from the 2026-01-06 Lyft Feature Store post:

Input: a SparkSQL query + JSON metadata config.
Codegen: "A Python cron service reads these configurations and automatically generates an Astronomer-hosted Airflow Directed Acyclic Graph (DAG). Crucially, these generated DAGs are production-ready out-of-the-box."
Generated-DAG responsibilities:
Executing the SparkSQL query to compute feature data.
Storing feature data to both offline (Hive) and online (dsfeatures) data paths.
Running integrated data-quality checks.
Compatibility for feature discovery — tagging Amundsen metadata.

Complementary development UX: Lyft's homegrown Kyte environment (an Airflow-local CLI) lets developers validate configs, test SQL runs, execute DAGs locally, and backfill historical dates before committing — so the codegen's quality tier is matched by a local iteration loop.

When to use it¶

Large org with many similar pipelines (hundreds of features, hundreds of ELT jobs, hundreds of model training runs).
The pipelines share a strong shape — same sinks, same quality gates, same metadata schema, same scheduling granularity.
The platform team can afford the codegen investment (the codegen is itself production-grade code that must keep up with Airflow-version changes, store-API changes, data-quality-gate evolution).

When not to use it¶

Pipelines have heterogeneous shape — each has its own operators, its own retries, its own alerts. Codegen becomes a straitjacket.
Only a handful of pipelines; the codegen investment is not amortised.
The DAG-author population is small and expert enough to hand-write DAGs productively.

systems/apache-airflow
systems/lyft-feature-store
concepts/feature-store
concepts/feature-discoverability
patterns/hybrid-batch-streaming-ingestion — codegen is the delivery mechanism for the batch lane in a hybrid ingestion design.

Seen in¶

sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution — canonical named instance: SparkSQL + JSON → Python-cron- generated Astronomer Airflow DAGs with built-in data-quality + Amundsen tagging.