Skip to content

PATTERN Cited by 1 source

Source Plus Transformation Feature Decomposition

Problem

Spark ETL pipelines tend to sprawl into monolithic scripts: one pipeline.py file with hundreds of lines of DataFrame reads + successive transformations + final writes. When the pipeline grows to dozens of stages, this becomes:

  • Hard to reuse — the same source table gets read multiple times by different transformation sequences, each written inline.
  • Hard to test — transformations aren't isolated; you can't test one aggregation without running the whole script.
  • Hard to debug — failures surface with no clear boundary between units of work.
  • Hard to parallelise development — two engineers modifying the same monolithic file step on each other's changes.

Pattern

Decompose the Spark ETL pipeline into features, where each feature is a self-contained class with a sources dict + a transform() method. Split features into two sub-types:

Source data snapshot features

Thin wrappers that read a raw input (e.g. an S3 snapshot) without transformation. Serve as DAG roots.

class TableXSnapshotFeature(SparkFeature):
    alias = "table_x_snapshot"

    def __init__(self):
        self.sources = {
            "table_x": S3PublishedSource(
                base_s3_path=get_s3_location_for_table("table_x"),
                ...,
            )
        }

    def transform(self, spark, start_date, end_date, **kwargs):
        return kwargs["table_x"]  # pass-through

Transformation features

Consume one or more other features as DataFrame inputs via the sources dict. Apply real transformation logic in transform().

class EnrichedContractsFeature(SparkFeature):
    def __init__(self):
        self.sources = {
            "contracts": ContractsSnapshotFeature(),
            "products": ProductsSnapshotFeature(),
        }

    def transform(self, spark, start_date, end_date, **kwargs):
        return kwargs["contracts"].join(
            kwargs["products"], on="product_id", how="left"
        ).filter(...)

Why the split matters

Source features are loaded once; transformations are composed many times

Multiple transformation features can depend on the same source feature. The runtime loads the source snapshot once and passes the resulting DataFrame to every downstream consumer — no redundant S3 reads.

contracts_snapshot ───┬──► feature_A
                       ├──► feature_B
                       └──► feature_C

Without the split, three transformation features written as inline scripts would each read contracts_snapshot from S3 independently — three times the S3 request cost and latency.

Source features are trivial; transformations are where the work lives

A pass-through source feature's transform() is one line. This lets code reviewers focus 100% of attention on the transformation features, where the business logic lives. Source-feature changes are usually schema updates, not logic.

The DAG shape is self-describing

Reading a transformation feature's self.sources dict tells you its dependencies immediately — no need to scan a 500-line pipeline script to trace data flow.

Yelp's canonical example

From the 2025-02-19 Revenue Data Pipeline post, the pipeline reads:

  • Many source snapshot features — one per operational MySQL table (revenue contracts, invoices, products, discounts, fulfillment events, ...).
  • Many transformation features — aggregation features (sum of line items per contract), enrichment features (join contract to product catalog), logic features (apply the multi-priority discount UDF per customer), template-mapping features (reshape to REVREC SaaS template).

The DAG converges: many snapshot roots → several transformation layers → a few output features that write the final REVREC- template Parquet to S3.

Composes with

Caveats

  • Not a silver bullet for feature-count sprawl — Yelp's pipeline has 50+ features and the team names this as a maintenance risk. The pattern makes the sprawl visible + more tractable, but doesn't prevent it. Further improvements (feature-team-owned standardised interfaces) are needed at scale.
  • Requires framework support — the pattern is most valuable when a runtime framework (Yelp's spark-etl, or equivalents) handles topological sort + DataFrame passing automatically. Writing the orchestration by hand defeats the benefit.
  • Source-feature pass-through cost — each feature has class
  • method overhead. For micro-pipelines, a monolithic script may be cheaper to write. The pattern pays off at ~5-10+ features.
  • Not applicable outside Spark — the source-vs- transformation distinction is idiomatic for dataflow engines. Stream processing (Flink, Kafka Streams) and workflow orchestrators (Airflow, Dagster) have different natural abstractions.

Seen in

Last updated · 476 distilled / 1,218 read