SYSTEM Cited by 1 source

Meta Data Ingestion System¶

Meta's Data Ingestion System is the infrastructure that incrementally scrapes several petabytes of social graph data per day from one of the world's largest MySQL deployments (the storage substrate underneath TAO) into Meta's data warehouse, powering "analytics, reporting, and downstream data products that teams across the company utilize for tasks ranging from day-to-day decision-making to machine learning model training and product development" (Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale).

As of 2026-05-12, Meta has fully deprecated the legacy customer-owned-pipelines architecture and migrated 100% of the workload to a "simpler self-managed data warehouse service that still operates efficiently at hyperscale."

Architecture¶

Both legacy and new architectures use Change Data Capture (CDC) to incrementally ingest data into the target table. Per-job schema is the canonical tri-layer CDC schema:

Table	Purpose	Owner
Full-dump table	Periodic full snapshot of source MySQL	Internal
Delta table	Captures incremental changes from source	Internal
Target table	Consumer-visible state = full-dump + applied deltas	Consumed by data customers

A central management service holds metadata for all job entities — table names, table schemas, and (added during the migration) per-partition data-quality flags.

Architectural shift: customer-owned-pipelines → self-managed¶

The legacy architecture distributed pipeline operation to each consumer team — each team ran their own pipeline. This "functioned effectively at a small scale" but began to "show signs of instability under the increasingly strict data landing time requirements" as Meta's operations grew.

The new architecture is self-managed by a single owning team, trading per-team customisation for centralised reliability + operational efficiency. The post does not name the new service or describe its specific implementation beyond stating that it "still operates efficiently at hyperscale."

Migration as the primary subject of disclosure¶

The 2026-05-12 post is structurally a migration-discipline post, not a new-architecture post. Two parallel challenges:

Per-job seamless transition — see patterns/shadow-then-reverse-shadow-migration (Shadow → Reverse Shadow → Cleanup lifecycle).
Tens-of-thousands-of-jobs scale — see patterns/automated-job-lifecycle-promotion (continuous-signal evaluation against three promotion criteria — data-quality, landing-latency, resource-utilization) plus patterns/known-issue-exclusion-batch-selection (defer affected jobs during root-issue remediation) plus patterns/snapshot-reuse-from-legacy-during-migration (skip the first full-dump by seeding from the legacy snapshot).

The hazard underneath both challenges: CDC bad-data propagation — a single corrupted target-table partition becomes embedded in every subsequent state. Containment via partition-level quality marking.

Data-quality analysis tool¶

Built as part of the migration but kept after the migration as part of release validation. For each landed shadow-table partition: read the corresponding production-table partition; compare row count + checksum; log mismatches to Scuba. Hourly: read the mismatch logs from Scuba; run example-row queries against the source data; log the debugging information back to Scuba. Operators query the augmented log stream.

See patterns/data-quality-analysis-tool-with-edge-case-logging for the abstracted shape.

Operational substrate¶

Source: MySQL (TAO's storage substrate)
Target: Meta data warehouse
Telemetry: Scuba (data-quality mismatches, job lifecycle signals, debugging context)
Metadata: Central management service (per-job table names, schemas, partition quality flags)
Operator surface: System-level + job-level dashboards; external migration tooling that auto-promotes / auto-demotes jobs between lifecycle phases.

Disclosed numbers¶

Daily ingestion: "several petabytes of social graph data"
Source: "one of the largest MySQL deployments in the world"
Job count migrated: "tens of thousands of ingestion jobs"
Migration outcome: 100% transitioned, legacy fully deprecated.

Not disclosed¶

The new self-managed service's name, the legacy system's name, the central management service's name, exact job count, migration duration, batch count, batch sizes, rollback rate during migration, compute / storage delta achieved between old and new system, the specific Scuba schemas, dashboard layouts.

Seen in¶

sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale — canonical wiki home for the system + the migration that consolidated it.

systems/mysql — the source-of-truth substrate
systems/meta-tao — the online-graph-store sharing the same MySQL substrate
systems/scuba-meta — the data-quality + lifecycle-signal warm store
concepts/change-data-capture — the underlying primitive
concepts/cdc-bad-data-propagation — the CDC-specific migration hazard
concepts/full-dump-vs-delta-vs-target — the tri-layer CDC schema
patterns/shadow-then-reverse-shadow-migration — the migration shape
patterns/automated-job-lifecycle-promotion — the migration-control-loop
patterns/partition-marking-stops-cdc-bleeding — the corruption-containment primitive
companies/meta — Meta company hub