Skip to content

CONCEPT Cited by 1 source

Full-dump vs delta vs target (tri-layer CDC schema)

Definition

The tri-layer CDC schema decomposes a CDC ingestion job's data into three internal tables, each with a distinct role:

Table Role Update cadence
Full-dump table Periodic full snapshot of the source database Periodic (slow + expensive)
Delta table Captures source changes (insert / update / delete events) Continuous (per-source-change)
Target table Consumer-visible state = full-dump + applied deltas Materialised at query time or on a fixed cadence

"Each data ingestion job has its own internal table for a full dump of source databases (full dump), an internal table for capturing changes of source databases (delta), and the target table consumed by the data customers. All the information about job entities, including table names and table schemas, is saved and managed by the central management service." — Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale

Why three tables, not two

Many CDC pipelines store only two: deltas (the change stream) + target (the materialised current state). The third — the full- dump table — is what bounds delta-replay cost. Without a periodic full-dump anchor, reconstructing target state requires replaying deltas from the beginning of time; with one, only deltas since the last full-dump need to be applied.

Concretely, the target table is computed as:

target = (latest full-dump) + (deltas applied after the full-dump's snapshot timestamp)

The frequency of the full-dump is a tradeoff:

  • More frequent full-dumps → smaller delta-replay window → faster target-table materialisation, but higher full-dump cost (slow + expensive scans of the source).
  • Less frequent full-dumps → larger delta-replay window → cheaper full-dump cost, but slower target-table materialisation + more delta-storage overhead.

Full-dumps are slow + expensive — and migrations multiply them

Meta's post explicitly names the full-dump cost as the load-bearing operational tax during a CDC system migration:

"Due to the system's CDC design, a new job's first snapshot was landed via a full dump, which is typically slow and expensive. If we detected data quality issues in a landed snapshot we also triggered another full dump to land a corrected snapshot after the underlying bugs were fixed."

Two distinct events trigger full-dumps during migration:

  1. First snapshot of a new job (one full-dump per migrated job).
  2. Data-quality remediation if the first snapshot has issues that get fixed (another full-dump per affected job).

Meta's response: patterns/snapshot-reuse-from-legacy-during-migration — reuse the legacy system's snapshot output as the new system's seed snapshot, avoiding the first full-dump entirely.

  • vs dual-table CDC (delta + target only): simpler but no full-dump anchor, so delta-replay cost grows with time-since- start. Many warehouse-side CDC pipelines start dual-table and add periodic full-dumps as an optimisation later.
  • vs log-only CDC (e.g. Kafka topics holding the change log): the change log is the delta layer; target reconstruction is pushed entirely to the consumer side. The tri-layer schema pre-materialises the target inside the ingestion system.
  • vs snapshot-only ingestion (no deltas): every load is a full-dump; freshness limited by full-dump cadence; no intermediate state visible to consumers.

Seen in

Last updated · 542 distilled / 1,571 read