SYSTEM Cited by 1 source
Pinterest CDC Ingestion Platform¶
Pinterest's next-generation CDC-based ingestion platform built on Kafka, Flink, Spark, and Iceberg. The platform ingests change-data-capture events from upstream source databases and materialises them into Iceberg tables (CDC table + base table) for offline analytics and ML consumption.
Architecture¶
The data flow has four layers:
- CDC layer — emits raw row updates from source databases (DDL changes surface as CDC messages)
- Kafka — transports CDC events
- Flink — parses source records, applies type conversion and custom transformation logic, writes to the CDC Iceberg table
- Spark — periodically reads the CDC table and upserts into the base Iceberg table; bootstrap jobs load historical data for table initialisation
Each layer relates to schema differently: Kafka is schema-transparent; Flink carries parsing + type-conversion logic generated from the schema; Spark carries upsert logic; Iceberg enforces structural schema at the table level.
Automated Schema Evolution¶
The platform includes an automated schema evolution framework that propagates supported (additive) schema changes across all layers without manual intervention:
- Push + pull detection for upstream schema drift
- Automated Iceberg schema updates via catalog API
- Code regeneration for Flink + Spark from latest schema + sink configuration
- PR-based rollout for auditability
- Three-phase convergence (schema → code → data)
- SLA-based eventual consistency rather than atomic propagation
Relationship to Other Pinterest Systems¶
The CDC ingestion platform is structurally upstream of the user-sequence platform — the Iceberg tables it produces are consumed by downstream ML pipelines including the Pinterest Foundation Model and TransAct training data.
Seen in¶
- sources/2026-06-24-pinterest-automated-schema-evolution-in-pinterests-next-generation-db — full description of the automated schema evolution framework, three-phase convergence model, and future zero-gap design