PATTERN Cited by 1 source
Source Plus Transformation Feature Decomposition¶
Problem¶
Spark ETL pipelines tend to sprawl into monolithic scripts:
one pipeline.py file with hundreds of lines of DataFrame
reads + successive transformations + final writes. When the
pipeline grows to dozens of stages, this becomes:
- Hard to reuse — the same source table gets read multiple times by different transformation sequences, each written inline.
- Hard to test — transformations aren't isolated; you can't test one aggregation without running the whole script.
- Hard to debug — failures surface with no clear boundary between units of work.
- Hard to parallelise development — two engineers modifying the same monolithic file step on each other's changes.
Pattern¶
Decompose the Spark ETL pipeline into features, where each
feature is a self-contained class with a sources dict + a
transform() method. Split features into two sub-types:
Source data snapshot features¶
Thin wrappers that read a raw input (e.g. an S3 snapshot) without transformation. Serve as DAG roots.
class TableXSnapshotFeature(SparkFeature):
alias = "table_x_snapshot"
def __init__(self):
self.sources = {
"table_x": S3PublishedSource(
base_s3_path=get_s3_location_for_table("table_x"),
...,
)
}
def transform(self, spark, start_date, end_date, **kwargs):
return kwargs["table_x"] # pass-through
Transformation features¶
Consume one or more other features as DataFrame inputs via the
sources dict. Apply real transformation logic in
transform().
class EnrichedContractsFeature(SparkFeature):
def __init__(self):
self.sources = {
"contracts": ContractsSnapshotFeature(),
"products": ProductsSnapshotFeature(),
}
def transform(self, spark, start_date, end_date, **kwargs):
return kwargs["contracts"].join(
kwargs["products"], on="product_id", how="left"
).filter(...)
Why the split matters¶
Source features are loaded once; transformations are composed many times¶
Multiple transformation features can depend on the same source feature. The runtime loads the source snapshot once and passes the resulting DataFrame to every downstream consumer — no redundant S3 reads.
Without the split, three transformation features written as
inline scripts would each read contracts_snapshot from S3
independently — three times the S3 request cost and latency.
Source features are trivial; transformations are where the work lives¶
A pass-through source feature's transform() is one line.
This lets code reviewers focus 100% of attention on the
transformation features, where the business logic lives.
Source-feature changes are usually schema updates, not logic.
The DAG shape is self-describing¶
Reading a transformation feature's self.sources dict tells
you its dependencies immediately — no need to scan a 500-line
pipeline script to trace data flow.
Yelp's canonical example¶
From the 2025-02-19 Revenue Data Pipeline post, the pipeline reads:
- Many source snapshot features — one per operational MySQL table (revenue contracts, invoices, products, discounts, fulfillment events, ...).
- Many transformation features — aggregation features (sum of line items per contract), enrichment features (join contract to product catalog), logic features (apply the multi-priority discount UDF per customer), template-mapping features (reshape to REVREC SaaS template).
The DAG converges: many snapshot roots → several transformation layers → a few output features that write the final REVREC- template Parquet to S3.
Composes with¶
- patterns/yaml-declared-feature-dag-topology-inferred
— features are listed as flat nodes in YAML; edges inferred
from each feature's
sources. - concepts/checkpoint-intermediate-dataframe-debugging — feature boundaries are natural checkpoint points for scratch-path materialisation.
- patterns/daily-mysql-snapshot-plus-spark-etl — the source features read MySQL daily snapshots.
Caveats¶
- Not a silver bullet for feature-count sprawl — Yelp's pipeline has 50+ features and the team names this as a maintenance risk. The pattern makes the sprawl visible + more tractable, but doesn't prevent it. Further improvements (feature-team-owned standardised interfaces) are needed at scale.
- Requires framework support — the pattern is most valuable
when a runtime framework (Yelp's
spark-etl, or equivalents) handles topological sort + DataFrame passing automatically. Writing the orchestration by hand defeats the benefit. - Source-feature pass-through cost — each feature has class
- method overhead. For micro-pipelines, a monolithic script may be cheaper to write. The pattern pays off at ~5-10+ features.
- Not applicable outside Spark — the source-vs- transformation distinction is idiomatic for dataflow engines. Stream processing (Flink, Kafka Streams) and workflow orchestrators (Airflow, Dagster) have different natural abstractions.
Seen in¶
- sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline — canonical wiki instance. Explicit two-category feature taxonomy (source snapshot / transformation) with code examples of each.