Skip to content

PATTERN Cited by 1 source

YAML-Declared Feature DAG with Topology Inferred

Problem

ETL pipelines need configuration that:

  • Lists the units of work (features, tasks, nodes) in the pipeline.
  • Declares the dependency structure (which unit depends on which).
  • Is human-readable and reviewable.

Naive DAG configs require both nodes and edges to be declared:

# Naive approach — edges explicit
features:
  feature_a:
    class: ...
    depends_on: []
  feature_b:
    class: ...
    depends_on: [feature_a]
  feature_c:
    class: ...
    depends_on: [feature_a, feature_b]

This creates redundancy:

  1. Dependency duplication — the feature's class already knows its dependencies (via imports, constructor calls, or source declarations). Writing them again in YAML means two places of truth that can drift.
  2. Error surface — mismatches between YAML depends_on and actual code dependencies are silent failures.
  3. Maintenance cost — adding a new dependency requires editing two places: the feature's code + its YAML entry.

Pattern

Declare only the nodes in YAML. Let the runtime infer the edges from each feature's in-code declaration of its dependencies.

The YAML is a flat list of feature aliases + their class paths, plus a terminal publish declaration:

features:
    feature1_alias:
        class: path.to.my.Feature1Class
    feature2_alias:
        class: path.to.my.Feature2Class
    feature3_alias:
        class: path.to.my.Feature3Class
    feature4_alias:
        class: path.to.my.Feature4Class

publish:
    s3:
        - feature4_alias:
            path: s3a://bucket/path/to/desired/location
            overwrite: True

The framework:

  1. Loads each feature class.
  2. Inspects the class's internal dependency declaration (e.g. Yelp's SparkFeature.sources dict).
  3. Topologically sorts all features by their declared dependencies.
  4. Executes in dependency order, passing each feature's output as a kwarg to downstream consumers.

The Yelp canon

From the 2025-02-19 Revenue Data Pipeline post, verbatim framing:

"The dependency relationship is handled by a user defined yaml file which contains all the related Spark features. There is no need to draw a complex diagram of dependency relationships in the yaml file. At runtime, spark-etl figures out the execution sequence according to topology."

Yelp's post shows a DAG where feature4 depends on feature2 + feature3, feature3 depends on feature1 + feature2, feature2 depends on feature1 — but the YAML config only lists the four features. The edges are recovered from each feature's code.

Why this works

The key insight: dependencies are a code-level fact, not a config-level fact. A feature's transform() method literally cannot run without its inputs. The inputs are enumerated in the feature's class (via sources dict or equivalent). Re-declaring them in YAML adds no information that isn't already available from the code.

The YAML's job is limited to:

  • Selection — which features should run in this pipeline (vs other pipelines using the same feature library).
  • Terminal publication — which feature outputs should be written to the sink.
  • Runtime parameters — date ranges, checkpoint lists, output paths.

Edges aren't selection, publication, or parameters. They belong in the code.

Benefits

  • DRY config — one source of truth (the class's dependency declaration), never two.
  • One-line feature addition — new features require one new YAML line. No graph edit. No depends_on enumeration.
  • No drift — YAML can never contradict the code because YAML doesn't claim anything about edges.
  • Self-validating — cycles and missing dependencies surface at topological sort time, not at runtime.
  • Human-readable — reviewers see a flat feature list + publish declaration; they can scan it in seconds.

Caveats

  • Requires a feature abstraction — the pattern requires each unit of work to be a class with a declarable dependency structure. Raw imperative scripts can't use it.
  • Cycles are silent until runtime — a cycle in class dependencies is only caught at topological-sort time. A linter can help catch this statically, but isn't standard.
  • Cross-cutting runtime config stays in YAML — e.g. "run feature_X but not feature_Y in this environment" still requires YAML-level selection, which can grow its own complexity.
  • Opaque DAG shape — someone reading the YAML alone can't see the DAG shape; they need to read the code. This trades config-visibility for code-visibility. Usually the right trade for an engineering team, less so for non-engineering stakeholders reviewing the pipeline.

Comparison to alternatives

Approach Nodes in Edges in Verdict
This pattern (YAML nodes, code edges) YAML Code DRY; single source of truth for dependencies
Naive explicit-edges YAML YAML YAML + Code Drift-prone; two sources of truth
Airflow >> operator in Python Code Code Similar philosophy; same trade-off at a different layer
dbt ref() SQL model files SQL ref() calls Same pattern in a SQL-first context
Dagster @op + @graph Python Python decorator args Same pattern in Python-first orchestration

The pattern generalises beyond Spark: any framework where units of work declare their own dependencies can skip explicit-edge config.

Seen in

Last updated · 476 distilled / 1,218 read