Skip to content

CONCEPT Cited by 1 source

MySQL Snapshot to S3 Data Lake

Definition

MySQL Snapshot to S3 Data Lake is the pattern of regularly (typically daily) copying the full state of MySQL operational tables into immutable Parquet / Avro / JSON files on S3, then running downstream analytics + ETL against those snapshots rather than directly against the live MySQL database.

The primary value is reproducibility: the snapshot is immutable, so the same input always produces the same output across reruns, regardless of when the rerun happens or how the live MySQL state has evolved since.

Why snapshot rather than query MySQL directly

Yelp's 2025-02-19 Revenue Data Pipeline post explicitly rejects "MySQL + Python Batch" as an architecture on exactly this reproducibility ground:

"Traditional method for generating financial reports. Rejected due to inconsistent rerun results from changing production data and slow batch processing times during peak data volumes."

Two failure modes named:

  1. Inconsistent rerun results from changing production data — if the job failed mid-run and needs to be retried, the underlying MySQL tables may have different values than they did when the first run started. Financial-systems pipelines can't tolerate this — the same input-day must always produce the same output.
  2. Slow batch processing times during peak data volumes — running analytics queries directly against the OLTP database competes with live transactional workload for the same connection pool + CPU + I/O budget.

Why S3 specifically

  • Cheap at scale — S3 Standard is ~\$0.023/GB/month; Glacier for older snapshots drops further. Keeping 30-90 days of daily snapshots is commercially reasonable for most businesses.
  • Immutable objects — S3 objects can be written once and treated as immutable. Combined with versioning, a snapshot path is a durable pointer to a specific state.
  • Spark-nativespark.read.parquet("s3a://...") is a one-liner; no JDBC connection pool tuning.
  • Decouples OLTP from OLAP — the live MySQL database is unaffected by analytics query load.

Snapshot mechanics (typical)

Not exhaustively covered in Yelp's post, but the standard shape:

  1. Dump — for each operational MySQL table, run a logical dump (e.g. via mysqldump / AWS DMS task / a logical- replication-aware tool) against a read replica to avoid impacting the primary.
  2. Convert — write as Parquet partitioned by a date column (Yelp's post names date_col="_dt" as the partition key).
  3. Land — upload to an S3 prefix that follows a date convention, e.g. s3a://bucket/tables/<table>/_dt=2024-02-29/.
  4. Register — register the path + schema in a data catalog (Yelp's post references get_schema_ids_for_data_snapshot as the schema-lookup helper).

Yelp's Spark feature for reading such a snapshot:

class TableSnapshotFeature(SparkFeature):
    alias = f"{TABLE_NAME}_snapshot"

    def __init__(self):
        self.sources = {
            TABLE_NAME: S3PublishedSource(
                base_s3_path=get_s3_location_for_table(TABLE_NAME),
                source_schema_id=get_schema_ids_for_data_snapshot(TABLE_NAME),
                date_col="_dt",
                select_cols=TABLE_SCHEMA,
            )
        }

    def transform(self, spark, start_date, end_date, **kwargs):
        return kwargs[TABLE_NAME]

Comparison to CDC-based alternatives

CDC streams row-level changes from MySQL in near-real-time; snapshots capture full state periodically. Tradeoffs:

Daily snapshot CDC
Freshness 24h stale max Seconds stale
Cost N copies of each table per month One copy + the log
Reproducibility Same input → same output, always Point-in-time reconstruction requires replay
Operational complexity Simple batch job CDC pipeline + log + replay infrastructure
Schema evolution Handled per-snapshot Needs schema-registry integration
Late-arriving data Captured in next snapshot Appears in-stream
Full-table re-read Every query Requires log replay back to a base snapshot

For financial / regulatory / audit workloads where reproducibility dominates and 24h freshness is fine, snapshots win. For real-time dashboards / alerting / low-latency analytics, CDC wins.

See sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb for the dual-use case at DynamoDB-scale: Segment uses S3 as the CDC log substrate, unifying both approaches at the object- storage layer.

Composes with

Caveats

  • 24h freshness ceiling — not suitable for real-time use cases. Monthly close-the-books: fine. Real-time fraud detection: not fine.
  • Storage cost scales with table size × snapshot frequency × retention — large tables snapshotted daily for long retention can become expensive. Lifecycle policies + cold tiers matter.
  • Schema-evolution coordination — when a source MySQL table's schema changes, downstream Spark jobs need coordinated schema updates. Yelp's post uses a schema-ID lookup (get_schema_ids_for_data_snapshot) which implies a versioned schema registry.
  • Dump window correctness — if the snapshot takes an hour to dump a large table, different rows represent state at different moments. For strict point-in-time consistency, use a read replica at a fixed binlog position.

Seen in

Last updated · 476 distilled / 1,218 read