Skip to content

SYSTEM Cited by 1 source

Meta Data Ingestion System

Meta's Data Ingestion System is the infrastructure that incrementally scrapes several petabytes of social graph data per day from one of the world's largest MySQL deployments (the storage substrate underneath TAO) into Meta's data warehouse, powering "analytics, reporting, and downstream data products that teams across the company utilize for tasks ranging from day-to-day decision-making to machine learning model training and product development" (Source: sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale).

As of 2026-05-12, Meta has fully deprecated the legacy customer-owned-pipelines architecture and migrated 100% of the workload to a "simpler self-managed data warehouse service that still operates efficiently at hyperscale."

Architecture

Both legacy and new architectures use Change Data Capture (CDC) to incrementally ingest data into the target table. Per-job schema is the canonical tri-layer CDC schema:

Table Purpose Owner
Full-dump table Periodic full snapshot of source MySQL Internal
Delta table Captures incremental changes from source Internal
Target table Consumer-visible state = full-dump + applied deltas Consumed by data customers

A central management service holds metadata for all job entities — table names, table schemas, and (added during the migration) per-partition data-quality flags.

Architectural shift: customer-owned-pipelines → self-managed

The legacy architecture distributed pipeline operation to each consumer team — each team ran their own pipeline. This "functioned effectively at a small scale" but began to "show signs of instability under the increasingly strict data landing time requirements" as Meta's operations grew.

The new architecture is self-managed by a single owning team, trading per-team customisation for centralised reliability + operational efficiency. The post does not name the new service or describe its specific implementation beyond stating that it "still operates efficiently at hyperscale."

Migration as the primary subject of disclosure

The 2026-05-12 post is structurally a migration-discipline post, not a new-architecture post. Two parallel challenges:

  1. Per-job seamless transition — see patterns/shadow-then-reverse-shadow-migration (Shadow → Reverse Shadow → Cleanup lifecycle).
  2. Tens-of-thousands-of-jobs scale — see patterns/automated-job-lifecycle-promotion (continuous-signal evaluation against three promotion criteria — data-quality, landing-latency, resource-utilization) plus patterns/known-issue-exclusion-batch-selection (defer affected jobs during root-issue remediation) plus patterns/snapshot-reuse-from-legacy-during-migration (skip the first full-dump by seeding from the legacy snapshot).

The hazard underneath both challenges: CDC bad-data propagation — a single corrupted target-table partition becomes embedded in every subsequent state. Containment via partition-level quality marking.

Data-quality analysis tool

Built as part of the migration but kept after the migration as part of release validation. For each landed shadow-table partition: read the corresponding production-table partition; compare row count + checksum; log mismatches to Scuba. Hourly: read the mismatch logs from Scuba; run example-row queries against the source data; log the debugging information back to Scuba. Operators query the augmented log stream.

See patterns/data-quality-analysis-tool-with-edge-case-logging for the abstracted shape.

Operational substrate

  • Source: MySQL (TAO's storage substrate)
  • Target: Meta data warehouse
  • Telemetry: Scuba (data-quality mismatches, job lifecycle signals, debugging context)
  • Metadata: Central management service (per-job table names, schemas, partition quality flags)
  • Operator surface: System-level + job-level dashboards; external migration tooling that auto-promotes / auto-demotes jobs between lifecycle phases.

Disclosed numbers

  • Daily ingestion: "several petabytes of social graph data"
  • Source: "one of the largest MySQL deployments in the world"
  • Job count migrated: "tens of thousands of ingestion jobs"
  • Migration outcome: 100% transitioned, legacy fully deprecated.

Not disclosed

The new self-managed service's name, the legacy system's name, the central management service's name, exact job count, migration duration, batch count, batch sizes, rollback rate during migration, compute / storage delta achieved between old and new system, the specific Scuba schemas, dashboard layouts.

Seen in

Last updated · 542 distilled / 1,571 read