CONCEPT Cited by 1 source
MySQL Snapshot to S3 Data Lake¶
Definition¶
MySQL Snapshot to S3 Data Lake is the pattern of regularly (typically daily) copying the full state of MySQL operational tables into immutable Parquet / Avro / JSON files on S3, then running downstream analytics + ETL against those snapshots rather than directly against the live MySQL database.
The primary value is reproducibility: the snapshot is immutable, so the same input always produces the same output across reruns, regardless of when the rerun happens or how the live MySQL state has evolved since.
Why snapshot rather than query MySQL directly¶
Yelp's 2025-02-19 Revenue Data Pipeline post explicitly rejects "MySQL + Python Batch" as an architecture on exactly this reproducibility ground:
"Traditional method for generating financial reports. Rejected due to inconsistent rerun results from changing production data and slow batch processing times during peak data volumes."
Two failure modes named:
- Inconsistent rerun results from changing production data — if the job failed mid-run and needs to be retried, the underlying MySQL tables may have different values than they did when the first run started. Financial-systems pipelines can't tolerate this — the same input-day must always produce the same output.
- Slow batch processing times during peak data volumes — running analytics queries directly against the OLTP database competes with live transactional workload for the same connection pool + CPU + I/O budget.
Why S3 specifically¶
- Cheap at scale — S3 Standard is ~\$0.023/GB/month; Glacier for older snapshots drops further. Keeping 30-90 days of daily snapshots is commercially reasonable for most businesses.
- Immutable objects — S3 objects can be written once and treated as immutable. Combined with versioning, a snapshot path is a durable pointer to a specific state.
- Spark-native —
spark.read.parquet("s3a://...")is a one-liner; no JDBC connection pool tuning. - Decouples OLTP from OLAP — the live MySQL database is unaffected by analytics query load.
Snapshot mechanics (typical)¶
Not exhaustively covered in Yelp's post, but the standard shape:
- Dump — for each operational MySQL table, run a logical
dump (e.g. via
mysqldump/ AWS DMS task / a logical- replication-aware tool) against a read replica to avoid impacting the primary. - Convert — write as Parquet partitioned by a date column
(Yelp's post names
date_col="_dt"as the partition key). - Land — upload to an S3 prefix that follows a date
convention, e.g.
s3a://bucket/tables/<table>/_dt=2024-02-29/. - Register — register the path + schema in a data catalog
(Yelp's post references
get_schema_ids_for_data_snapshotas the schema-lookup helper).
Yelp's Spark feature for reading such a snapshot:
class TableSnapshotFeature(SparkFeature):
alias = f"{TABLE_NAME}_snapshot"
def __init__(self):
self.sources = {
TABLE_NAME: S3PublishedSource(
base_s3_path=get_s3_location_for_table(TABLE_NAME),
source_schema_id=get_schema_ids_for_data_snapshot(TABLE_NAME),
date_col="_dt",
select_cols=TABLE_SCHEMA,
)
}
def transform(self, spark, start_date, end_date, **kwargs):
return kwargs[TABLE_NAME]
Comparison to CDC-based alternatives¶
CDC streams row-level changes from MySQL in near-real-time; snapshots capture full state periodically. Tradeoffs:
| Daily snapshot | CDC | |
|---|---|---|
| Freshness | 24h stale max | Seconds stale |
| Cost | N copies of each table per month | One copy + the log |
| Reproducibility | Same input → same output, always | Point-in-time reconstruction requires replay |
| Operational complexity | Simple batch job | CDC pipeline + log + replay infrastructure |
| Schema evolution | Handled per-snapshot | Needs schema-registry integration |
| Late-arriving data | Captured in next snapshot | Appears in-stream |
| Full-table re-read | Every query | Requires log replay back to a base snapshot |
For financial / regulatory / audit workloads where reproducibility dominates and 24h freshness is fine, snapshots win. For real-time dashboards / alerting / low-latency analytics, CDC wins.
See sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb for the dual-use case at DynamoDB-scale: Segment uses S3 as the CDC log substrate, unifying both approaches at the object- storage layer.
Composes with¶
- concepts/data-lakehouse — the modern evolution: add Iceberg-style table semantics on top of snapshots for ACID + time travel + schema evolution. Yelp's 2025-02-19 post predates this evolution at Yelp; snapshots are "just" Parquet without open-table-format layering.
- ELT-vs-ETL — snapshots + S3 enable ELT (land raw first, transform in warehouse engine).
- patterns/daily-mysql-snapshot-plus-spark-etl — the canonical production shape of the snapshot-plus-engine pairing.
- concepts/spark-etl-feature-dag — the DAG abstraction built on top of snapshot inputs.
Caveats¶
- 24h freshness ceiling — not suitable for real-time use cases. Monthly close-the-books: fine. Real-time fraud detection: not fine.
- Storage cost scales with table size × snapshot frequency × retention — large tables snapshotted daily for long retention can become expensive. Lifecycle policies + cold tiers matter.
- Schema-evolution coordination — when a source MySQL
table's schema changes, downstream Spark jobs need
coordinated schema updates. Yelp's post uses a schema-ID
lookup (
get_schema_ids_for_data_snapshot) which implies a versioned schema registry. - Dump window correctness — if the snapshot takes an hour to dump a large table, different rows represent state at different moments. For strict point-in-time consistency, use a read replica at a fixed binlog position.
Seen in¶
- sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline — canonical wiki instance. Yelp's explicit rejection of MySQL+Python Batch on reproducibility grounds; Data Lake + Spark ETL chosen specifically for independent reproducibility
- peak-time scalability.
Related¶
- systems/mysql — the operational source database
- systems/aws-s3 — the snapshot substrate
- systems/apache-spark — the typical downstream engine
- concepts/data-lakehouse — the modern open-table-format evolution
- concepts/elt-vs-etl — enabled by snapshotting
- concepts/change-data-capture — the real-time alternative
- patterns/daily-mysql-snapshot-plus-spark-etl
- companies/yelp