Skip to content

PATTERN Cited by 1 source

Daily MySQL Snapshot Plus Spark ETL

Problem

You need to build an OLTP-to-analytics pipeline where:

  • Source-of-truth data lives in MySQL operational tables that mutate continuously.
  • Analytics / ETL jobs must be reproducible across reruns — same input-day should always produce the same output, regardless of when the rerun happens.
  • Peak-time throughput must be decoupled from the OLTP database so the analytics workload doesn't compete with live transactions.
  • Complex business logic is required that SQL alone can't express cleanly (multi-priority tie-breaks, stateful grouped-row iteration).
  • Sub-day freshness is not required — a 24-hour freshness ceiling is acceptable.

Canonical domains: financial reporting (revenue recognition, book close), regulatory compliance (audit trails), month-end analytics, subscription-lifecycle reporting.

Pattern

Daily-snapshot each MySQL operational table into S3 as Parquet, then run a Spark ETL pipeline against the snapshots using a feature-DAG abstraction with a YAML-declared node list and topology-inferred edges.

Concrete shape:

MySQL operational DB (mutable, live)
       │ daily dump (read replica → Parquet)
S3 (immutable snapshots, date-partitioned)
[Spark ETL framework](<../systems/yelp-spark-etl.md>)
  ├─ source-snapshot features (read unchanged)
  └─ transformation features (filter / join / UDF / aggregate)
S3 (published outputs in target-template shape)
Consumer: data warehouse, 3rd party SaaS, BI tools, etc.

Load-bearing properties

  • Independent reproducibility: the snapshot is immutable, so the same input always produces the same output. Reruns are safe.
  • Peak-time scalability: Spark auto-scales its executor pool based on workload; MySQL doesn't see the analytics query load at all.
  • Community support: Spark has a vast ecosystem of tooling, patterns, and human expertise.
  • Full expressive power: UDFs + built-in functions cover simple and complex logic (vs warehouse+dbt, which struggles on multi-priority tie-break logic per concepts/pyspark-udf-for-complex-business-logic).

Canonical implementation choices (Yelp 2025-02-19)

Concern Choice Why
Snapshot substrate AWS S3 (Parquet) Cheap, immutable, Spark-native
Snapshot frequency Daily Matches financial-reporting cadence
Processing engine Apache Spark (PySpark) Community, expressive power, peak scalability
DAG orchestration Internal spark-etl package Feature abstraction + YAML topology-inferred
Debugging Checkpoint to scratch + JupyterHub Substitutes for breakpoint debugging on distributed+lazy Spark
Complex logic PySpark UDFs When built-ins + window functions would be hard to read

Rejected alternatives (Yelp's documented evaluation)

  • MySQL + Python batch — rejected on reproducibility (mutating production data breaks rerun invariants) + peak- time throughput.
  • Data Warehouse + dbt — rejected because revenue- recognition logic "difficult to represent in SQL." The warehouse
  • dbt path works brilliantly for aggregation-heavy pipelines (see concepts/elt-vs-etl + Canva's case) but struggles when business logic requires stateful iteration or multi- priority tie-breaks.
  • Event Streams + Stream Processing — rejected because (a) immediate presentation not needed, (b) the third-party consumer (REVREC SaaS in Yelp's case) doesn't support stream integration, adding complexity without benefit.

Composes with

When not to use

  • Real-time freshness required — daily snapshots have a 24h staleness ceiling. If the consumer needs seconds-scale freshness, use CDC instead (e.g. Debezium + Kafka + stream processing).
  • Pure aggregation workload — if the transformation is entirely expressible as SQL (aggregates, filters, joins), warehouse + dbt may be simpler and cheaper. See Canva's case study.
  • Small data volume — the whole snapshot + Spark infrastructure has overhead; for small tables, a Python cronjob against a read replica may be cheaper to build and maintain.
  • Third-party consumer supports streaming — if the downstream can accept a stream (e.g. event-driven analytics warehouse), the stream path eliminates the 24h latency.

Caveats

  • Snapshot storage cost scales — full daily snapshots of many large tables can get expensive. Lifecycle policies + cold tiers matter. For very large tables, incremental snapshots or CDC-based alternatives may be cheaper.
  • Feature-count ceiling — Yelp's pipeline runs 50+ features in a 2-staged Spark job and the team names this as a maintenance risk. The pattern doesn't solve the feature- sprawl problem; it just makes the sprawl visible.
  • UDF cost accumulates — complex logic via UDFs is easy to write but expensive to run. Plan for a follow-on pass to replace UDFs with native PySpark where source-data simplification allows.
  • Schema evolution coordination — source MySQL schema changes must be reconciled with Spark consumers. Requires versioned schema registry + coordinated deploy.

Seen in

Last updated · 476 distilled / 1,218 read