Skip to content

SYSTEM Cited by 1 source

Pinterest CDC Ingestion Platform

Pinterest's next-generation CDC-based ingestion platform built on Kafka, Flink, Spark, and Iceberg. The platform ingests change-data-capture events from upstream source databases and materialises them into Iceberg tables (CDC table + base table) for offline analytics and ML consumption.

Architecture

The data flow has four layers:

  1. CDC layer — emits raw row updates from source databases (DDL changes surface as CDC messages)
  2. Kafka — transports CDC events
  3. Flink — parses source records, applies type conversion and custom transformation logic, writes to the CDC Iceberg table
  4. Spark — periodically reads the CDC table and upserts into the base Iceberg table; bootstrap jobs load historical data for table initialisation

Each layer relates to schema differently: Kafka is schema-transparent; Flink carries parsing + type-conversion logic generated from the schema; Spark carries upsert logic; Iceberg enforces structural schema at the table level.

Automated Schema Evolution

The platform includes an automated schema evolution framework that propagates supported (additive) schema changes across all layers without manual intervention:

Relationship to Other Pinterest Systems

The CDC ingestion platform is structurally upstream of the user-sequence platform — the Iceberg tables it produces are consumed by downstream ML pipelines including the Pinterest Foundation Model and TransAct training data.

Seen in

Last updated · 559 distilled / 1,651 read