SYSTEM Cited by 1 source

Pinterest CDC Ingestion Platform¶

Pinterest's next-generation CDC-based ingestion platform built on Kafka, Flink, Spark, and Iceberg. The platform ingests change-data-capture events from upstream source databases and materialises them into Iceberg tables (CDC table + base table) for offline analytics and ML consumption.

Architecture¶

The data flow has four layers:

CDC layer — emits raw row updates from source databases (DDL changes surface as CDC messages)
Kafka — transports CDC events
Flink — parses source records, applies type conversion and custom transformation logic, writes to the CDC Iceberg table
Spark — periodically reads the CDC table and upserts into the base Iceberg table; bootstrap jobs load historical data for table initialisation

Each layer relates to schema differently: Kafka is schema-transparent; Flink carries parsing + type-conversion logic generated from the schema; Spark carries upsert logic; Iceberg enforces structural schema at the table level.

Automated Schema Evolution¶

The platform includes an automated schema evolution framework that propagates supported (additive) schema changes across all layers without manual intervention:

Push + pull detection for upstream schema drift
Automated Iceberg schema updates via catalog API
Code regeneration for Flink + Spark from latest schema + sink configuration
PR-based rollout for auditability
Three-phase convergence (schema → code → data)
SLA-based eventual consistency rather than atomic propagation

Relationship to Other Pinterest Systems¶

The CDC ingestion platform is structurally upstream of the user-sequence platform — the Iceberg tables it produces are consumed by downstream ML pipelines including the Pinterest Foundation Model and TransAct training data.

Seen in¶

sources/2026-06-24-pinterest-automated-schema-evolution-in-pinterests-next-generation-db — full description of the automated schema evolution framework, three-phase convergence model, and future zero-gap design