PATTERN Cited by 1 source
Daily MySQL Snapshot Plus Spark ETL¶
Problem¶
You need to build an OLTP-to-analytics pipeline where:
- Source-of-truth data lives in MySQL operational tables that mutate continuously.
- Analytics / ETL jobs must be reproducible across reruns — same input-day should always produce the same output, regardless of when the rerun happens.
- Peak-time throughput must be decoupled from the OLTP database so the analytics workload doesn't compete with live transactions.
- Complex business logic is required that SQL alone can't express cleanly (multi-priority tie-breaks, stateful grouped-row iteration).
- Sub-day freshness is not required — a 24-hour freshness ceiling is acceptable.
Canonical domains: financial reporting (revenue recognition, book close), regulatory compliance (audit trails), month-end analytics, subscription-lifecycle reporting.
Pattern¶
Daily-snapshot each MySQL operational table into S3 as Parquet, then run a Spark ETL pipeline against the snapshots using a feature-DAG abstraction with a YAML-declared node list and topology-inferred edges.
Concrete shape:
MySQL operational DB (mutable, live)
│
│ daily dump (read replica → Parquet)
▼
S3 (immutable snapshots, date-partitioned)
│
▼
[Spark ETL framework](<../systems/yelp-spark-etl.md>)
├─ source-snapshot features (read unchanged)
└─ transformation features (filter / join / UDF / aggregate)
│
▼
S3 (published outputs in target-template shape)
│
▼
Consumer: data warehouse, 3rd party SaaS, BI tools, etc.
Load-bearing properties¶
- Independent reproducibility: the snapshot is immutable, so the same input always produces the same output. Reruns are safe.
- Peak-time scalability: Spark auto-scales its executor pool based on workload; MySQL doesn't see the analytics query load at all.
- Community support: Spark has a vast ecosystem of tooling, patterns, and human expertise.
- Full expressive power: UDFs + built-in functions cover simple and complex logic (vs warehouse+dbt, which struggles on multi-priority tie-break logic per concepts/pyspark-udf-for-complex-business-logic).
Canonical implementation choices (Yelp 2025-02-19)¶
| Concern | Choice | Why |
|---|---|---|
| Snapshot substrate | AWS S3 (Parquet) | Cheap, immutable, Spark-native |
| Snapshot frequency | Daily | Matches financial-reporting cadence |
| Processing engine | Apache Spark (PySpark) | Community, expressive power, peak scalability |
| DAG orchestration | Internal spark-etl package |
Feature abstraction + YAML topology-inferred |
| Debugging | Checkpoint to scratch + JupyterHub | Substitutes for breakpoint debugging on distributed+lazy Spark |
| Complex logic | PySpark UDFs | When built-ins + window functions would be hard to read |
Rejected alternatives (Yelp's documented evaluation)¶
- MySQL + Python batch — rejected on reproducibility (mutating production data breaks rerun invariants) + peak- time throughput.
- Data Warehouse + dbt — rejected because revenue- recognition logic "difficult to represent in SQL." The warehouse
- dbt path works brilliantly for aggregation-heavy pipelines (see concepts/elt-vs-etl + Canva's case) but struggles when business logic requires stateful iteration or multi- priority tie-breaks.
- Event Streams + Stream Processing — rejected because (a) immediate presentation not needed, (b) the third-party consumer (REVREC SaaS in Yelp's case) doesn't support stream integration, adding complexity without benefit.
Composes with¶
- patterns/source-plus-transformation-feature-decomposition — the feature-DAG decomposition that makes the Spark pipeline maintainable at 50+ features.
- patterns/yaml-declared-feature-dag-topology-inferred — the YAML config shape for declaring the feature list.
- concepts/checkpoint-intermediate-dataframe-debugging — the debugging-workflow complement.
- concepts/mysql-snapshot-to-s3-data-lake — the reproducibility primitive.
When not to use¶
- Real-time freshness required — daily snapshots have a 24h staleness ceiling. If the consumer needs seconds-scale freshness, use CDC instead (e.g. Debezium + Kafka + stream processing).
- Pure aggregation workload — if the transformation is entirely expressible as SQL (aggregates, filters, joins), warehouse + dbt may be simpler and cheaper. See Canva's case study.
- Small data volume — the whole snapshot + Spark infrastructure has overhead; for small tables, a Python cronjob against a read replica may be cheaper to build and maintain.
- Third-party consumer supports streaming — if the downstream can accept a stream (e.g. event-driven analytics warehouse), the stream path eliminates the 24h latency.
Caveats¶
- Snapshot storage cost scales — full daily snapshots of many large tables can get expensive. Lifecycle policies + cold tiers matter. For very large tables, incremental snapshots or CDC-based alternatives may be cheaper.
- Feature-count ceiling — Yelp's pipeline runs 50+ features in a 2-staged Spark job and the team names this as a maintenance risk. The pattern doesn't solve the feature- sprawl problem; it just makes the sprawl visible.
- UDF cost accumulates — complex logic via UDFs is easy to write but expensive to run. Plan for a follow-on pass to replace UDFs with native PySpark where source-data simplification allows.
- Schema evolution coordination — source MySQL schema changes must be reconciled with Spark consumers. Requires versioned schema registry + coordinated deploy.
Seen in¶
- sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline
— canonical wiki instance. Full four-option architecture
evaluation (three rejections + this pattern chosen), concrete
implementation via Yelp's
spark-etlpackage, known scale (50+ features, 2 stages), named future-improvement axes.