CONCEPT Cited by 1 source

MySQL Snapshot to S3 Data Lake¶

Definition¶

MySQL Snapshot to S3 Data Lake is the pattern of regularly (typically daily) copying the full state of MySQL operational tables into immutable Parquet / Avro / JSON files on S3, then running downstream analytics + ETL against those snapshots rather than directly against the live MySQL database.

The primary value is reproducibility: the snapshot is immutable, so the same input always produces the same output across reruns, regardless of when the rerun happens or how the live MySQL state has evolved since.

Why snapshot rather than query MySQL directly¶

Yelp's 2025-02-19 Revenue Data Pipeline post explicitly rejects "MySQL + Python Batch" as an architecture on exactly this reproducibility ground:

"Traditional method for generating financial reports. Rejected due to inconsistent rerun results from changing production data and slow batch processing times during peak data volumes."

Two failure modes named:

Inconsistent rerun results from changing production data — if the job failed mid-run and needs to be retried, the underlying MySQL tables may have different values than they did when the first run started. Financial-systems pipelines can't tolerate this — the same input-day must always produce the same output.
Slow batch processing times during peak data volumes — running analytics queries directly against the OLTP database competes with live transactional workload for the same connection pool + CPU + I/O budget.

Why S3 specifically¶

Cheap at scale — S3 Standard is ~\$0.023/GB/month; Glacier for older snapshots drops further. Keeping 30-90 days of daily snapshots is commercially reasonable for most businesses.
Immutable objects — S3 objects can be written once and treated as immutable. Combined with versioning, a snapshot path is a durable pointer to a specific state.
Spark-native — spark.read.parquet("s3a://...") is a one-liner; no JDBC connection pool tuning.
Decouples OLTP from OLAP — the live MySQL database is unaffected by analytics query load.

Snapshot mechanics (typical)¶

Not exhaustively covered in Yelp's post, but the standard shape:

Dump — for each operational MySQL table, run a logical dump (e.g. via mysqldump / AWS DMS task / a logical- replication-aware tool) against a read replica to avoid impacting the primary.
Convert — write as Parquet partitioned by a date column (Yelp's post names date_col="_dt" as the partition key).
Land — upload to an S3 prefix that follows a date convention, e.g. s3a://bucket/tables/<table>/_dt=2024-02-29/.
Register — register the path + schema in a data catalog (Yelp's post references get_schema_ids_for_data_snapshot as the schema-lookup helper).

Yelp's Spark feature for reading such a snapshot:

class TableSnapshotFeature(SparkFeature):
    alias = f"{TABLE_NAME}_snapshot"

    def __init__(self):
        self.sources = {
            TABLE_NAME: S3PublishedSource(
                base_s3_path=get_s3_location_for_table(TABLE_NAME),
                source_schema_id=get_schema_ids_for_data_snapshot(TABLE_NAME),
                date_col="_dt",
                select_cols=TABLE_SCHEMA,
            )
        }

    def transform(self, spark, start_date, end_date, **kwargs):
        return kwargs[TABLE_NAME]

Comparison to CDC-based alternatives¶

CDC streams row-level changes from MySQL in near-real-time; snapshots capture full state periodically. Tradeoffs:

	Daily snapshot	CDC
Freshness	24h stale max	Seconds stale
Cost	N copies of each table per month	One copy + the log
Reproducibility	Same input → same output, always	Point-in-time reconstruction requires replay
Operational complexity	Simple batch job	CDC pipeline + log + replay infrastructure
Schema evolution	Handled per-snapshot	Needs schema-registry integration
Late-arriving data	Captured in next snapshot	Appears in-stream
Full-table re-read	Every query	Requires log replay back to a base snapshot

For financial / regulatory / audit workloads where reproducibility dominates and 24h freshness is fine, snapshots win. For real-time dashboards / alerting / low-latency analytics, CDC wins.

See sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb for the dual-use case at DynamoDB-scale: Segment uses S3 as the CDC log substrate, unifying both approaches at the object- storage layer.

Composes with¶

concepts/data-lakehouse — the modern evolution: add Iceberg-style table semantics on top of snapshots for ACID + time travel + schema evolution. Yelp's 2025-02-19 post predates this evolution at Yelp; snapshots are "just" Parquet without open-table-format layering.
ELT-vs-ETL — snapshots + S3 enable ELT (land raw first, transform in warehouse engine).
patterns/daily-mysql-snapshot-plus-spark-etl — the canonical production shape of the snapshot-plus-engine pairing.
concepts/spark-etl-feature-dag — the DAG abstraction built on top of snapshot inputs.

Caveats¶

24h freshness ceiling — not suitable for real-time use cases. Monthly close-the-books: fine. Real-time fraud detection: not fine.
Storage cost scales with table size × snapshot frequency × retention — large tables snapshotted daily for long retention can become expensive. Lifecycle policies + cold tiers matter.
Schema-evolution coordination — when a source MySQL table's schema changes, downstream Spark jobs need coordinated schema updates. Yelp's post uses a schema-ID lookup (get_schema_ids_for_data_snapshot) which implies a versioned schema registry.
Dump window correctness — if the snapshot takes an hour to dump a large table, different rows represent state at different moments. For strict point-in-time consistency, use a read replica at a fixed binlog position.

Seen in¶

sources/2025-02-19-yelp-revenue-automation-series-building-revenue-data-pipeline — canonical wiki instance. Yelp's explicit rejection of MySQL+Python Batch on reproducibility grounds; Data Lake + Spark ETL chosen specifically for independent reproducibility
peak-time scalability.

systems/mysql — the operational source database
systems/aws-s3 — the snapshot substrate
systems/apache-spark — the typical downstream engine
concepts/data-lakehouse — the modern open-table-format evolution
concepts/elt-vs-etl — enabled by snapshotting
concepts/change-data-capture — the real-time alternative
patterns/daily-mysql-snapshot-plus-spark-etl
companies/yelp