PATTERN Cited by 1 source
Three-phase schema convergence¶
Pattern¶
Decompose schema evolution in a distributed data pipeline into three ordered phases:
-
Schema Divergence — update the storage-layer schema first (e.g., Iceberg table metadata). Existing processing code continues without failure because it selects by name, not position, and new columns default to null.
-
Code Convergence — deploy updated transformation/writer code. Deploy the batch layer first (so backfill can begin from the relevant watermark), then deploy the streaming layer (so new records are parsed correctly).
-
Data Convergence — batch backfills historical data for the new schema; streaming processes new data correctly; the base table converges to the latest schema and content.
Why it works¶
The key insight is that Iceberg's nullable-by-default semantics for new columns allow the storage schema to advance without breaking existing code. Code updates are decoupled from schema updates. Data correctness is restored last because it depends on both schema and code being in place.
Deployment sequencing rationale¶
Spark (batch) deploys before Flink (streaming) because: - Spark is watermark-based — a failed run is retryable from the last watermark - Flink failures + Kafka retention expiration = data loss - Spark backfill from the schema-evolution timestamp prepares the base table for new streaming data
Seen in¶
- sources/2026-06-24-pinterest-automated-schema-evolution-in-pinterests-next-generation-db — canonical instance: Pinterest's CDC ingestion platform uses this pattern for automated schema evolution across Kafka → Flink → Spark → Iceberg