Skip to content

META 2026-05-12

Read original ↗

Meta — Migrating Data Ingestion Systems at Meta Scale

Summary

A 2026-05-12 Meta Engineering Data Infrastructure post describing how Meta successfully migrated 100% of its data ingestion workload off a legacy customer-owned-pipelines architecture onto a simpler self-managed data-warehouse service that "still operates efficiently at hyperscale," and fully deprecated the legacy system. The substrate being migrated: the daily petabyte-scale incremental scrape of Meta's social graph from one of the world's largest MySQL deployments into the Data Warehouse, powering analytics, reporting, ML training, and product development across Meta. The migration challenge is the post's focus, not the new architecture per se: "how to make sure each job would be migrated seamlessly" combined with "how to perform large scale migration itself" across tens of thousands of ingestion jobs. Meta's solution decomposes into two orthogonal solution sets. (A) Per-job seamless transition is governed by a three-phase migration job lifecycleShadow Phase (a shadow job in pre-production reads the same source as the production job but writes to a separate shadow table; row count + checksum compared continuously) → Reverse Shadow Phase (the shadow job is promoted to production-table writer; the original production job is demoted to writing the shadow table — "effectively making the shadow job the new production job") → Migration Cleanup (the demoted job, now running on the legacy system, is removed) — canonical instance of the new patterns/shadow-then-reverse-shadow-migration pattern, structurally distinct from parallel run (which keeps the old system authoritative throughout) and from Notion's four-step double-write+backfill (which dual-writes from one writer to two stores instead of running two writers in opposite directions). Three explicit promotion criteria gate every phase transition: (1) zero data-quality discrepancies (row count + checksum match between old and new system's outputs); (2) no landing-latency regression (new system delivers data on time at minimum, ideally faster); (3) no resource-utilization regression (compute + storage are equal-or-better than the legacy job). For critical tables, additional service-team-negotiated criteria apply. (B) Large-scale migration execution decomposes into automated lifecycle promotion (jobs continuously emit promotion-criteria signals to Scuba; external migration tooling auto-promotes or auto-demotes jobs between lifecycle phases based on whether the criteria are still met — canonical instance of the new patterns/automated-job-lifecycle-promotion pattern, with system-level + job-level dashboards as the operator surface) plus batch planning under capacity constraints (capacity won't let all shadow jobs run simultaneously, so jobs are categorised by throughput, priority, and special cases; batches deliberately exclude jobs with known unresolved issues to avoid noise — canonical instance of patterns/known-issue-exclusion-batch-selection; Meta also reuses snapshot partitions delivered by the legacy system as the new system's initial snapshot to avoid the slow + expensive full-dump cost that CDC's first landing otherwise incurs — canonical instance of patterns/snapshot-reuse-from-legacy-during-migration). Underneath both solution sets sits the architectural fact that this is a CDC system migrating to another CDC system: each job has a full-dump table → delta table → target table tri-layer data flow, with metadata in a central management service. CDC's defining hazard — bad-data propagation — is the property that "the data generated by the system is used again to generate the new data," meaning "if previous landed data has any issues the problematic data will be passed to the new landed data." Meta's two-pronged response: (1) early signals after rollout (during reverse shadow, trigger backfill on both production and shadow jobs — if the backfill outputs match, migration is verified; if not, immediate rollback without consumer impact); (2) partition-level quality marking stops bleeding (when a partition is detected as bad, mark it in metadata as having bad data quality — a delta partition so marked stops new data landing and alerts a team member; a target partition so marked causes the system to select an older partition and merge it with more deltas instead) — canonical instance of the new patterns/partition-marking-stops-cdc-bleeding pattern. The custom data-quality analysis tool — log mismatches to Scuba, hourly read those logs, run example-row queries to find the offending rows, log debugging info back — is canonical instance of the new patterns/data-quality-analysis-tool-with-edge-case-logging pattern; it also continues to serve as a release-validation tool post-migration, not just during it. The post is architecture-and-discipline voice: production numbers (the full migration completed; legacy system fully deprecated; all jobs landed) but no QPS, no fleet size, no exact job count beyond "tens of thousands," no compute / storage delta between old and new, no migration duration, no rollback rate during migration, no number-of-batches, no full-dump cost in absolute terms.

Key takeaways

  1. The forcing function is a CDC system's incumbent-rules-itself problem. A naive "build a new ingestion system, point all source tables at it" fails because Meta's legacy system exhibited "signs of instability under the increasingly strict data landing time requirements" — and any CDC migration carries the bad-data-propagation risk that a single corrupted delta becomes embedded in every subsequent target-table state. The post is structurally a migration-discipline post, not a new-system-architecture post: "migrating a data ingestion system of this scale was a major challenge" — both because of per-job correctness and because "how to perform large scale migration itself" is a separate problem from any single job's correctness. The two sub-challenges are addressed independently throughout the article (Source).
  2. The three-phase Shadow → Reverse Shadow → Cleanup lifecycle is the canonical migration shape on the wiki for active-CDC systems. Distinguishing characteristic: in Shadow Phase, both jobs run with the shadow job's data going to a separate shadow table so the production job is unaffected. In Reverse Shadow Phase, the writes swap targets — the shadow job's data is now written to the production table, and the production job's data is written to the shadow table. "Effectively making the shadow job the new production job" — and the original production job "now acts as the shadow job." This swap-not-side-by-side primitive is what enables the third critical property: a fast rollback path that requires no recreation of the legacy job because the legacy job is still there, still running, just labelled as shadow now. "We could roll back fast if discrepancies were detected, without needing to recreate or reconfigure the old system job." Canonical instance of patterns/shadow-then-reverse-shadow-migration — the variant of double-system-running migration where each system gets a turn as the production-target writer (Source).
  3. Three explicit, machine-checkable promotion criteria gate every phase transition. "No data quality issues. There is no difference between the data delivered by the old system and the new system. We verify this by comparing both the row count and the checksum of the data, ensuring complete consistency between the two systems." Plus: "No landing latency regression is observed. The data delivered by the new system should exhibit improved landing latency, or at minimum, match the performance of the old system." Plus: "No resource utilization regression is observed. The compute and storage usage of the job running in the new system should be improved, or at minimum, be comparable to that of the old system." The criteria are stated as non-regression floors with improvement as the optimistic case — the migration is allowed to be performance-neutral but not allowed to be a perf regression. For "the critical table migration" additional team-specific criteria are negotiated. Canonical wiki instance of migration promotion criteria as a deployable contract — emit the signals; let the criteria evaluator decide promotion automatically (Source).
  4. Bad-data propagation is the defining CDC-migration hazard. "Being a CDC process means the data generated by the system is used again to generate the new data. This means if previous landed data has any issues the problematic data will be passed to the new landed data." This is structurally different from non-CDC migrations because in batch-snapshot systems, a single bad batch can be discarded; in CDC, a single bad delta becomes part of the canonical state used for the next delta computation. Meta's two-pronged answer: (a) early signals before consumer impact via dual-direction backfill comparison during reverse shadow; (b) partition-level quality marking stops bleeding so corrupted data is bounded inside its partition rather than propagating forward. Canonical wiki instance of concepts/cdc-bad-data-propagation as a first-class concept and patterns/partition-marking-stops-cdc-bleeding as the canonical containment primitive (Source).
  5. Partition-level quality marking is the operational containment mechanism. "During the reverse shadow phase, if any data quality issues were detected in a specific partition, that partition would be marked in its metadata as having bad data quality. If this partition was a delta partition, then new data would stop landing, and an alert would be sent to a team member. If this partition was a target partition, the system would instead select an older partition and merge it with more deltas." Two operational behaviours follow from the marking: a bad delta halts CDC consumption (no new propagation) and pages an operator; a bad target partition is substituted with an older known-good partition merged forward through additional deltas — meaning the system has a substitute-on-fault primitive at the target layer that is invisible to consumers. "In this way we could stop bad data propagation quickly. For rollback, we could quickly query the metadata to find all partitions that were marked with bad data quality and fix them with backfill." The marking serves both as a runtime guard (during normal operation) and as a rollback-substrate index (for post-incident bulk fix). Canonical instance of patterns/partition-marking-stops-cdc-bleeding (Source).
  6. The custom data-quality analysis tool is the debugging substrate; it survives the migration. "For each landed shadow table partition, the system would read the corresponding production table partition and compare both the row count and checksum. Any mismatches were logged to Scuba, Meta's data management system for real-time analysis. Every hour, the data quality analysis tool read the logs from Scuba, ran queries to identify example rows causing mismatches, and logged detailed debugging information back to Scuba. This process enabled team members to quickly determine the root cause of issues and assess whether they were already known and being addressed." Notably: "This same data quality analysis tool is still in use after the migration as part of the release validation process." — the migration produced a permanent operational tool as a side effect, the kind of byproduct that justifies the up-front investment in tooling because the tool's amortised value extends past the migration. Canonical instance of patterns/data-quality-analysis-tool-with-edge-case-logging — the periodic-log-aggregation-then-targeted-query debugging primitive on top of Scuba (Source).
  7. Automated lifecycle promotion is the leverage primitive that scales to tens of thousands of jobs. "Since we established a clear migration job lifecycle and job promotion criteria, the system continuously sent job status signals to Scuba, including data related to the lifecycle promotion criteria and the job's current stage in the migration lifecycle. We built external migration tools that continuously monitored signals from each job and automatically promoted or demoted jobs between stages of the migration lifecycle, depending on whether they met (or no longer met) the migration criteria. We also built system-level and job-level dashboards so engineers could quickly track the overall migration progress as well as monitor and debug individual jobs." The demotion path matters: a job that was promoted to reverse shadow can be automatically demoted back to shadow if its criteria stop being met — a one-way gating valve would get stuck on transient discrepancies. The two-axis dashboard (system-level + job-level) is the operator surface for the gating system. Canonical instance of patterns/automated-job-lifecycle-promotion — the continuously-evaluated promotion-and-demotion-as-control-loop primitive that is impossible to do manually at tens-of-thousands-of-jobs scale (Source).
  8. Batch planning explicitly excludes jobs with known unresolved issues — to avoid debugging-noise multiplication. "For instance, they established selection criteria to exclude jobs with known issues that were still being resolved, thereby reducing noise caused by duplicate issues." And: "We avoided creating new shadow jobs with known issues until those issues were resolved. When an issue was detected we removed any potentially affected jobs from the migration list and held them until a fix was in place." The structural insight: a known issue affecting N jobs generates N copies of the same alert through the data-quality tool — each false-positive consumes operator time, and the N+1th occurrence is noise, not signal. The mitigation: delay the affected jobs' migration until the root issue is fixed, then migrate them as a batch. Canonical instance of patterns/known-issue-exclusion-batch-selection — the defer-affected-batches-until-root-fix primitive that distinguishes noise-tolerant migration planning from naive go-as-fast-as-possible planning (Source).
  9. CDC full-dumps are slow + expensive — Meta reuses legacy snapshot partitions to avoid them. "Due to the system's CDC design, a new job's first snapshot was landed via a full dump, which is typically slow and expensive. If we detected data quality issues in a landed snapshot we also triggered another full dump to land a corrected snapshot after the underlying bugs were fixed. Creating shadow jobs while known issues were still present would therefore trigger a lot of unnecessary full dumps, both at job creation time and again during data-quality remediation. By avoiding creating those jobs, we avoided large amounts of extra full dump work and improved migration efficiency. We also built creative solutions like reusing snapshot partitions delivered by the old system as snapshot initially to reduce the full dump load." Two distinct optimisations, both around the expensive-full-dump tax: (a) don't create the job until it's worth its first full dump (dovetails with batch planning above); (b) bypass the first full dump entirely by treating the legacy system's most recent snapshot output as the new job's seed snapshot. Canonical instance of patterns/snapshot-reuse-from-legacy-during-migration — possible only when both systems agree on snapshot semantics, which is a property of both being CDC systems against the same MySQL source (Source).
  10. The architectural shape underneath both systems: full-dump table + delta table + target table, all governed by a central management service. "Both our legacy and new data ingestion systems used change data capture (CDC) to incrementally ingest data into the target table. Each data ingestion job has its own internal table for a full dump of source databases (full dump), an internal table for capturing changes of source databases (delta), and the target table consumed by the data customers. All the information about job entities, including table names and table schemas, is saved and managed by the central management service." Three table types per job: full-dump table (periodic snapshot of source), delta table (per-source-change increments), target table (consumer-visible, computed as full-dump + applied deltas). The central management service holds metadata. Canonical wiki statement of the full-dump-vs-delta-vs-target tri-layer CDC schema — distinct from the simpler dual-table CDC patterns (just deltas + target) by virtue of having an explicit periodic full-dump anchor that bounds delta-replay cost (Source).

Architectural numbers + operational notes (from source)

  • Migration outcome: "successfully transitioned 100% of the workload and fully deprecated the legacy system."
  • Source-data scale (substrate, not migration scale): "Every day, our data ingestion system incrementally scrapes several petabytes of social graph data from MySQL into the data warehouse" — daily petabyte-scale incremental ingestion against Meta's social graph backed by "one of the largest MySQL deployments in the world".
  • Job count being migrated: "tens of thousands of ingestion jobs" — exact count not disclosed.
  • Daily disruptive task throughput (substrate, prior post): not redisclosed here; see sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta|2024-06-16 OpsPlanner for ~1M ops/day.
  • Three promotion criteria (verbatim verb-stripped): row-count + checksum match between old and new system; no landing-latency regression (improved or matching); no resource-utilization regression (improved or matching).
  • Critical-table criteria: "For the critical table migration, we defined and agreed on extra migration criteria with the teams who were reliant on the service." — not enumerated.
  • Lifecycle gates (named): Shadow Phase entry → Production-environment promotion → Reverse Shadow Phase → Migration Cleanup.
  • Operational substrate for signals: Scuba"Meta's data management system for real-time analysis" — both as the destination for mismatch logs and as the consumed-by-tools source of those logs.
  • Data-quality-analysis-tool cadence: "Every hour, the data quality analysis tool read the logs from Scuba, ran queries to identify example rows causing mismatches, and logged detailed debugging information back to Scuba."
  • Tool persistence: "This same data quality analysis tool is still in use after the migration as part of the release validation process."
  • Tri-layer CDC schema per job: full-dump table + delta table + target table, metadata in a central management service.
  • Bad-partition behaviour: delta partition marked bad → new data stops landing + operator alert; target partition marked bad → system selects an older partition and merges with more deltas.
  • Reverse-shadow rollback verification: "we triggered backfill on both production and shadow jobs. If the backfill results still matched it indicated the migration is successful. If the result did not match, the job would be rolled back immediately and data consumers would not be impacted."
  • Snapshot reuse: "reusing snapshot partitions delivered by the old system as snapshot initially to reduce the full dump load."
  • Nothing disclosed about: exact job count, migration duration, batch count, batch sizes, number of rollbacks during migration, compute/storage delta between old and new (only that improvement was a non-strict goal), specific Scuba schemas, dashboard layouts, the central management service's name or implementation, the new self-managed data-warehouse service's name or implementation, the legacy customer-owned-pipelines architecture's name, the named teams reliant on critical tables, the rate at which jobs failed promotion criteria, the rate at which automatic demotion fired, full-dump duration in time or compute terms.

Systems extracted

New wiki page:

  • systems/meta-data-ingestion-system — Meta's data ingestion infrastructure scraping the social graph from MySQL into the Data Warehouse via CDC. Recently (as of 2026-05-12) migrated end-to-end from a legacy customer-owned-pipelines architecture to a simpler self-managed data-warehouse service that operates efficiently at hyperscale. Daily petabyte-scale incremental scrape. Per-job tri-layer schema: full-dump table + delta table + target table, governed by a central management service. Substrate for analytics, reporting, ML training, and product development across Meta. Canonical wiki home for the ingestion-system architecture and the migration that consolidated it.

Extended (cross-link added):

  • systems/mysql — adds Meta's MySQL deployment as canonical wiki instance of "one of the largest MySQL deployments in the world" serving as both TAO's storage substrate and the source-of-truth feed for the data ingestion system's CDC pipeline. Reinforces MySQL's CDC-source role at hyperscale.
  • systems/meta-tao — adds note that the daily-petabyte incremental-CDC pipeline against TAO's underlying MySQL is the substrate of the data-warehouse-side analytics + ML-training infrastructure, complementing TAO's online-graph-store role with the offline-analytics consumer path.
  • systems/scuba-meta — adds third canonical Scuba upstream producer alongside FBCrypto (cryptographic-monitoring counts) and Strobelight (symbolized profile samples) — now: data-quality-mismatch logs from the data ingestion system's quality analysis tool. Reinforces Scuba's role as Meta's universal warm-query substrate for fleet-wide telemetry of any structured event type.

Concepts extracted

New wiki pages:

  • concepts/cdc-bad-data-propagation — the structural hazard of any CDC system: because target state at time T+1 is computed from target state at time T plus deltas, a corrupted target at time T becomes embedded in every subsequent state without separate intervention. The CDC-specific failure mode that distinguishes CDC-system migrations from batch-system migrations.
  • concepts/migration-job-lifecycle — the per-job state machine governing where each job is in a multi-system migration: pre-shadow → shadow → reverse-shadow → cleanup. Each transition is gated by promotion criteria; each phase is reversible via demotion. Generalises beyond Meta-specific terminology.
  • concepts/shadow-job-pre-production — a job that runs in a pre-production environment, consumes the same source as a production job, and writes to a separate shadow table (not the production table). Used to validate a new ingestion path against real production data without affecting consumers.
  • concepts/reverse-shadow-phase — the phase of a Shadow → Reverse Shadow → Cleanup migration in which the shadow job is promoted to write the production table while the original production job is demoted to write the shadow table. Provides ongoing data-quality signal post-rollout and a same-shape rollback path that requires no system reconfiguration.
  • concepts/landing-latency — the time elapsed between a source-data event and the corresponding row appearing in the consumer-visible target table; the data-warehouse-equivalent of an end-to-end latency SLI. One of the three migration promotion criteria in the Meta ingestion-system migration.
  • concepts/data-quality-checksum-comparison — comparing row counts and checksums between two parallel sources of the same logical data as the canonical correctness primitive. Cheap to compute, cheap to store, and bounds the comparison cost to O(partitions) rather than O(rows).
  • concepts/partition-quality-marking — annotating a partition's metadata with a quality flag (good / bad / unknown) so downstream behaviour can branch: bad-quality delta partitions stop CDC consumption; bad-quality target partitions are substituted with older known-good partitions merged forward.
  • concepts/full-dump-vs-delta-vs-target — the canonical tri-layer schema of a CDC ingestion job: the full-dump table is a periodic snapshot of the source; the delta table captures source changes; the target table is the consumer-visible result computed as full-dump + applied deltas. Distinct from dual-layer (delta+target) CDC schemas.

Extended (cross-link added):

  • concepts/change-data-capture — extended with the CDC-migration framing: when a CDC system itself is migrated, two CDC pipelines run side by side; bad-data propagation becomes a cross-system hazard, not just within-system. Adds a Seen-in entry for this source.
  • concepts/blast-radius — extended with the partition-marking-stops-bleeding primitive as the smallest-unit blast-radius containment for CDC-style data corruption: a single bad partition is contained, not allowed to propagate forward.

Patterns extracted

New wiki pages:

  • patterns/shadow-then-reverse-shadow-migration — the canonical three-phase migration shape for two CDC systems sharing a source: (1) Shadow — new system writes to a separate shadow table; (2) Reverse Shadow — writes swap (new system writes production table; old system writes shadow table); (3) Cleanup — old system removed. Distinguishing characteristic vs patterns/parallel-run-pattern: production-target writer swaps, doesn't stay with the old system. Distinguishing characteristic vs patterns/notion-double-write-backfill-verify-switchover: two separate writers, not one writer dual-writing.
  • patterns/automated-job-lifecycle-promotion — the migration-control-loop pattern: every job continuously emits promotion-criteria signals to a telemetry substrate; an external tool reads those signals and automatically promotes or demotes the job between lifecycle phases. The mechanism that scales migration to tens of thousands of jobs without per-job manual gating.
  • patterns/partition-marking-stops-cdc-bleeding — when a CDC system detects a bad partition, annotate it in metadata rather than deleting or correcting in-place. Bad-quality delta partitions halt new data landing + alert; bad-quality target partitions are substituted with older known-good partitions merged forward. Provides both a runtime guard and a rollback-substrate index.
  • patterns/known-issue-exclusion-batch-selection — when planning a batch of jobs/changes for migration, exclude jobs known to be affected by an unresolved underlying issue until that issue is resolved. Prevents N copies of the same alert from drowning the data-quality signal during validation.
  • patterns/snapshot-reuse-from-legacy-during-migration — when migrating between two CDC systems against the same source, reuse the legacy system's most recent snapshot output as the new system's seed snapshot rather than triggering a fresh full dump. Bypasses the slow + expensive full-dump tax that CDC's first landing otherwise requires.
  • patterns/data-quality-analysis-tool-with-edge-case-logging — the periodic-log-aggregation-then-targeted-query debugging primitive on top of a warm-query telemetry store: log mismatches at first detection; a periodic job reads those logs, runs targeted queries to find example offending rows, and logs the debugging information back to the same store. Operators query the augmented log stream rather than the source data.

Caveats / gaps

  • Architecture-and-discipline voice, not architecture-of-the-new-system voice. The post says the new system is "a simpler self-managed data warehouse service that still operates efficiently at hyperscale" but does not name it, describe its components, or compare it structurally to the legacy customer-owned-pipelines architecture beyond that one sentence. This is a migration discipline post.
  • No production numbers on the migration itself. No total job count, no migration duration, no batch count or sizes, no rollback rate, no false-positive rate of data-quality detection, no full-dump cost in time or compute terms, no compute/storage delta achieved between old and new system. The headline is "100% transitioned, legacy fully deprecated" — outcome only.
  • Critical-table criteria are mentioned but not enumerated. "We defined and agreed on extra migration criteria with the teams who were reliant on the service" — for important tables — but those criteria are not stated.
  • The central management service is mentioned but not named. Likely related to Meta's broader data-platform substrate but the post does not call it out by name or describe its API surface beyond "all the information about job entities, including table names and table schemas, is saved and managed by the central management service."
  • The legacy system is not named. "customer-owned pipelines" is a description of the architectural shape (each consumer team operated their own pipeline) but no system name is attached. Without a name, cross-referencing the prior 2013-2025 pipeline-system corpus is impossible.
  • Architectural diagrams referenced but not described in text. Three images are inlined in the post (migration lifecycle, CDC data flow, bad-data propagation prevention) but their content beyond the descriptive captions is not narrated; the text leaves the reader to infer the visual structure.
  • Acknowledgements section is generic. "We would like to thank all the team members and the leadership that contributed to make this project a success in Meta." — no named team, no named contributors.

Source

Last updated · 542 distilled / 1,571 read