Skip to content

PINTEREST 2026-06-24

Read original ↗

Automated Schema Evolution in Pinterest's Next-Generation DB Ingestion Framework

Summary

Pinterest Engineering (Yisheng Zhou, Liang Mou, Gabriel Raphael Garcia Montoya, Istvan Podor) document how they built automated schema evolution into their CDC-based ingestion platform spanning Kafka, Flink, Spark, and Iceberg. The core thesis: in a distributed CDC pipeline, schema is not just metadata — it is a cross-system contract spanning ingestion, transformation, storage, and historical backfill. A schema change that is not handled carefully can break Flink jobs, block Spark upserts, or create inconsistencies between online and offline representations. Their solution treats schema evolution as a multi-stage convergence process rather than an atomic operation, preserving pipeline availability while gradually restoring schema and data correctness within a bounded SLA window.

Key Takeaways

  1. Schema as cross-system contract. A source schema change requires coordinated updates to CDC source configuration, Kafka provisioning, Flink transformation code, Spark writer logic, Iceberg table definitions, and bootstrap queries — all driven by the same schema. Manual updates across these layers increase drift risk. (Source: Introduction)

  2. Additive-only restriction is deliberate. Only additive changes (add columns, widen numeric precision) are automated. Type narrowing, primary key changes, and lossy conversions require coordinated migration or full re-onboarding. This preserves backward compatibility and avoids historical replay complexity. (Source: Supported Schema Changes)

  3. Three-phase convergence model. Phase 1 (Schema Divergence): Iceberg schemas updated first; existing jobs write null for new columns. Phase 2 (Code Convergence): Spark updated first (for backfill from watermark), then Flink (for new records). Phase 3 (Data Convergence): Spark backfills historical data, Flink processes new data correctly, base table converges. (Source: A Three-Phase Convergence Model)

  4. Push + pull detection. Push-based: upstream DDL CDC message triggers immediate comparison against Iceberg catalog API. Pull-based: daily comparison job independently detects drift. Push gives low-latency response; pull is the safety net. (Source: How Schema Evolution Works)

  5. PR-based rollout with version auditing. All schema evolution changes flow through an automated PR workflow — regenerated Flink/Spark code, updated Iceberg schemas, refreshed bootstrap queries — providing auditability, versioning, and reviewable safety. (Source: How Schema Evolution Works)

  6. SLA-based eventual consistency, not atomicity. Schema changes are acceptable within a predictable window because downstream consumers don't require real-time schema reflection. Temporary divergence is bounded; deployment sequencing restores full consistency. (Source: SLA and Deployment Strategy)

  7. Flink deployment sensitivity. A failed Flink job + Kafka retention expiration = data loss. Mitigation: staging validation step — update staging pipeline first, verify, then deploy to production. Spark is less sensitive because it's watermark-based and can resume from the last checkpoint. (Source: SLA and Deployment Strategy)

  8. Stable column identifiers enable safe diffing. Tables with a dedicated schema definition file carry per-column stable numeric identifiers that persist across revisions, enabling unambiguous tracking across reorders and renames. Tables exposing only CREATE TABLE lack this; ambiguity is resolved via Skeema + binlog audit trail. (Source: Onboarding + Unsupported and Edge Cases)

  9. Concurrent changes serialized. Only one schema evolution workflow runs at a time; subsequent changes are queued and processed sequentially, preventing race conditions. (Source: Unsupported and Edge Cases)

  10. Future: zero-gap schema evolution. A dynamic Iceberg sink that applies schema updates directly at the CDC table layer, writing new fields through a generic conversion path before Flink code is deployed. Unsupported types route to a dead-letter queue. Waiting on a Flink version bump. (Source: Toward Zero-Gap Schema Evolution)

Architecture

The data flow: source databases → CDC layer (emits raw row updates) → Kafka (transport) → Flink (parse, type-convert, transform, write to CDC Iceberg table) → Spark (periodic upsert from CDC table into base Iceberg table) + bootstrap jobs (historical data load).

Schema evolution automation sits in the control plane and orchestrates: detect drift (push/pull) → compare against Iceberg catalog → if supported change → update Iceberg schemas → regenerate Flink + Spark code → open PR → deploy (staging → production).

Operational Numbers

  • Supported changes: add column, widen numeric precision (e.g., INT → BIGINT)
  • Unsupported (require migration): primary key changes, columns with default values, sensitive data in non-sensitive pipelines, type narrowing, lossy conversions
  • Convergence model: 3 phases (schema → code → data)
  • Detection mechanisms: 2 (push-based DDL CDC + pull-based daily comparison)
  • Concurrency: serial (one workflow at a time, queued)

Caveats

  • No quantitative SLA disclosed (hours? minutes? days?)
  • No throughput or scale numbers (tables managed, changes per day)
  • Zero-gap design is aspirational, blocked on Flink version
  • Skeema-based ambiguity resolution for CREATE TABLE diffs adds a non-trivial operational dependency
  • Column transformations (e.g., epoch millis → Iceberg TIMESTAMP) are handled via sink-configuration annotations — not described in full detail
  • No discussion of cross-datacenter schema propagation

Source

Last updated · 559 distilled / 1,651 read