PATTERN Cited by 1 source
YAML-Declared Feature DAG with Topology Inferred¶
Problem¶
ETL pipelines need configuration that:
- Lists the units of work (features, tasks, nodes) in the pipeline.
- Declares the dependency structure (which unit depends on which).
- Is human-readable and reviewable.
Naive DAG configs require both nodes and edges to be declared:
# Naive approach — edges explicit
features:
feature_a:
class: ...
depends_on: []
feature_b:
class: ...
depends_on: [feature_a]
feature_c:
class: ...
depends_on: [feature_a, feature_b]
This creates redundancy:
- Dependency duplication — the feature's class already knows its dependencies (via imports, constructor calls, or source declarations). Writing them again in YAML means two places of truth that can drift.
- Error surface — mismatches between YAML
depends_onand actual code dependencies are silent failures. - Maintenance cost — adding a new dependency requires editing two places: the feature's code + its YAML entry.
Pattern¶
Declare only the nodes in YAML. Let the runtime infer the edges from each feature's in-code declaration of its dependencies.
The YAML is a flat list of feature aliases + their class paths, plus a terminal publish declaration:
features:
feature1_alias:
class: path.to.my.Feature1Class
feature2_alias:
class: path.to.my.Feature2Class
feature3_alias:
class: path.to.my.Feature3Class
feature4_alias:
class: path.to.my.Feature4Class
publish:
s3:
- feature4_alias:
path: s3a://bucket/path/to/desired/location
overwrite: True
The framework:
- Loads each feature class.
- Inspects the class's internal dependency declaration (e.g.
Yelp's
SparkFeature.sourcesdict). - Topologically sorts all features by their declared dependencies.
- Executes in dependency order, passing each feature's output as a kwarg to downstream consumers.
The Yelp canon¶
From the 2025-02-19 Revenue Data Pipeline post, verbatim framing:
"The dependency relationship is handled by a user defined yaml file which contains all the related Spark features. There is no need to draw a complex diagram of dependency relationships in the yaml file. At runtime, spark-etl figures out the execution sequence according to topology."
Yelp's post shows a DAG where feature4 depends on feature2 + feature3, feature3 depends on feature1 + feature2, feature2 depends on feature1 — but the YAML config only lists the four features. The edges are recovered from each feature's code.
Why this works¶
The key insight: dependencies are a code-level fact, not a
config-level fact. A feature's transform() method
literally cannot run without its inputs. The inputs are
enumerated in the feature's class (via sources dict or
equivalent). Re-declaring them in YAML adds no information
that isn't already available from the code.
The YAML's job is limited to:
- Selection — which features should run in this pipeline (vs other pipelines using the same feature library).
- Terminal publication — which feature outputs should be written to the sink.
- Runtime parameters — date ranges, checkpoint lists, output paths.
Edges aren't selection, publication, or parameters. They belong in the code.
Benefits¶
- DRY config — one source of truth (the class's dependency declaration), never two.
- One-line feature addition — new features require one
new YAML line. No graph edit. No
depends_onenumeration. - No drift — YAML can never contradict the code because YAML doesn't claim anything about edges.
- Self-validating — cycles and missing dependencies surface at topological sort time, not at runtime.
- Human-readable — reviewers see a flat feature list + publish declaration; they can scan it in seconds.
Caveats¶
- Requires a feature abstraction — the pattern requires each unit of work to be a class with a declarable dependency structure. Raw imperative scripts can't use it.
- Cycles are silent until runtime — a cycle in class dependencies is only caught at topological-sort time. A linter can help catch this statically, but isn't standard.
- Cross-cutting runtime config stays in YAML — e.g. "run feature_X but not feature_Y in this environment" still requires YAML-level selection, which can grow its own complexity.
- Opaque DAG shape — someone reading the YAML alone can't see the DAG shape; they need to read the code. This trades config-visibility for code-visibility. Usually the right trade for an engineering team, less so for non-engineering stakeholders reviewing the pipeline.
Comparison to alternatives¶
| Approach | Nodes in | Edges in | Verdict |
|---|---|---|---|
| This pattern (YAML nodes, code edges) | YAML | Code | DRY; single source of truth for dependencies |
| Naive explicit-edges YAML | YAML | YAML + Code | Drift-prone; two sources of truth |
Airflow >> operator in Python |
Code | Code | Similar philosophy; same trade-off at a different layer |
dbt ref() |
SQL model files | SQL ref() calls |
Same pattern in a SQL-first context |
Dagster @op + @graph |
Python | Python decorator args | Same pattern in Python-first orchestration |
The pattern generalises beyond Spark: any framework where units of work declare their own dependencies can skip explicit-edge config.
Seen in¶
- sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline
— canonical wiki instance. Yelp's
spark-etlpackage config example showing nodes-only YAML + runtime topological sort; verbatim framing about not needing "a complex diagram of dependency relationships in the yaml file."
Related¶
- systems/yelp-spark-etl — canonical implementation
- systems/apache-spark — underlying engine
- concepts/spark-etl-feature-dag — the broader feature-DAG model
- patterns/source-plus-transformation-feature-decomposition — the companion decomposition pattern
- companies/yelp