PATTERN Cited by 1 source

Daily MySQL Snapshot Plus Spark ETL¶

Problem¶

You need to build an OLTP-to-analytics pipeline where:

Source-of-truth data lives in MySQL operational tables that mutate continuously.
Analytics / ETL jobs must be reproducible across reruns — same input-day should always produce the same output, regardless of when the rerun happens.
Peak-time throughput must be decoupled from the OLTP database so the analytics workload doesn't compete with live transactions.
Complex business logic is required that SQL alone can't express cleanly (multi-priority tie-breaks, stateful grouped-row iteration).
Sub-day freshness is not required — a 24-hour freshness ceiling is acceptable.

Canonical domains: financial reporting (revenue recognition, book close), regulatory compliance (audit trails), month-end analytics, subscription-lifecycle reporting.

Pattern¶

Daily-snapshot each MySQL operational table into S3 as Parquet, then run a Spark ETL pipeline against the snapshots using a feature-DAG abstraction with a YAML-declared node list and topology-inferred edges.

Concrete shape:

MySQL operational DB (mutable, live)
       │
       │ daily dump (read replica → Parquet)
       ▼
S3 (immutable snapshots, date-partitioned)
       │
       ▼
[Spark ETL framework](<../systems/yelp-spark-etl.md>)
  ├─ source-snapshot features (read unchanged)
  └─ transformation features (filter / join / UDF / aggregate)
       │
       ▼
S3 (published outputs in target-template shape)
       │
       ▼
Consumer: data warehouse, 3rd party SaaS, BI tools, etc.

Load-bearing properties¶

Independent reproducibility: the snapshot is immutable, so the same input always produces the same output. Reruns are safe.
Peak-time scalability: Spark auto-scales its executor pool based on workload; MySQL doesn't see the analytics query load at all.
Community support: Spark has a vast ecosystem of tooling, patterns, and human expertise.
Full expressive power: UDFs + built-in functions cover simple and complex logic (vs warehouse+dbt, which struggles on multi-priority tie-break logic per concepts/pyspark-udf-for-complex-business-logic).

Canonical implementation choices (Yelp 2025-02-19)¶

Concern	Choice	Why
Snapshot substrate	AWS S3 (Parquet)	Cheap, immutable, Spark-native
Snapshot frequency	Daily	Matches financial-reporting cadence
Processing engine	Apache Spark (PySpark)	Community, expressive power, peak scalability
DAG orchestration	Internal `spark-etl` package	Feature abstraction + YAML topology-inferred
Debugging	Checkpoint to scratch + JupyterHub	Substitutes for breakpoint debugging on distributed+lazy Spark
Complex logic	PySpark UDFs	When built-ins + window functions would be hard to read

Rejected alternatives (Yelp's documented evaluation)¶

MySQL + Python batch — rejected on reproducibility (mutating production data breaks rerun invariants) + peak- time throughput.
Data Warehouse + dbt — rejected because revenue- recognition logic "difficult to represent in SQL." The warehouse
dbt path works brilliantly for aggregation-heavy pipelines (see concepts/elt-vs-etl + Canva's case) but struggles when business logic requires stateful iteration or multi- priority tie-breaks.
Event Streams + Stream Processing — rejected because (a) immediate presentation not needed, (b) the third-party consumer (REVREC SaaS in Yelp's case) doesn't support stream integration, adding complexity without benefit.

Composes with¶

patterns/source-plus-transformation-feature-decomposition — the feature-DAG decomposition that makes the Spark pipeline maintainable at 50+ features.
patterns/yaml-declared-feature-dag-topology-inferred — the YAML config shape for declaring the feature list.
concepts/checkpoint-intermediate-dataframe-debugging — the debugging-workflow complement.
concepts/mysql-snapshot-to-s3-data-lake — the reproducibility primitive.

When not to use¶

Real-time freshness required — daily snapshots have a 24h staleness ceiling. If the consumer needs seconds-scale freshness, use CDC instead (e.g. Debezium + Kafka + stream processing).
Pure aggregation workload — if the transformation is entirely expressible as SQL (aggregates, filters, joins), warehouse + dbt may be simpler and cheaper. See Canva's case study.
Small data volume — the whole snapshot + Spark infrastructure has overhead; for small tables, a Python cronjob against a read replica may be cheaper to build and maintain.
Third-party consumer supports streaming — if the downstream can accept a stream (e.g. event-driven analytics warehouse), the stream path eliminates the 24h latency.

Caveats¶

Snapshot storage cost scales — full daily snapshots of many large tables can get expensive. Lifecycle policies + cold tiers matter. For very large tables, incremental snapshots or CDC-based alternatives may be cheaper.
Feature-count ceiling — Yelp's pipeline runs 50+ features in a 2-staged Spark job and the team names this as a maintenance risk. The pattern doesn't solve the feature- sprawl problem; it just makes the sprawl visible.
UDF cost accumulates — complex logic via UDFs is easy to write but expensive to run. Plan for a follow-on pass to replace UDFs with native PySpark where source-data simplification allows.
Schema evolution coordination — source MySQL schema changes must be reconciled with Spark consumers. Requires versioned schema registry + coordinated deploy.

Seen in¶

sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline — canonical wiki instance. Full four-option architecture evaluation (three rejections + this pattern chosen), concrete implementation via Yelp's spark-etl package, known scale (50+ features, 2 stages), named future-improvement axes.