Skip to content

PATTERN Cited by 1 source

Three-phase schema convergence

Pattern

Decompose schema evolution in a distributed data pipeline into three ordered phases:

  1. Schema Divergence — update the storage-layer schema first (e.g., Iceberg table metadata). Existing processing code continues without failure because it selects by name, not position, and new columns default to null.

  2. Code Convergence — deploy updated transformation/writer code. Deploy the batch layer first (so backfill can begin from the relevant watermark), then deploy the streaming layer (so new records are parsed correctly).

  3. Data Convergence — batch backfills historical data for the new schema; streaming processes new data correctly; the base table converges to the latest schema and content.

Why it works

The key insight is that Iceberg's nullable-by-default semantics for new columns allow the storage schema to advance without breaking existing code. Code updates are decoupled from schema updates. Data correctness is restored last because it depends on both schema and code being in place.

Deployment sequencing rationale

Spark (batch) deploys before Flink (streaming) because: - Spark is watermark-based — a failed run is retryable from the last watermark - Flink failures + Kafka retention expiration = data loss - Spark backfill from the schema-evolution timestamp prepares the base table for new streaming data

Seen in

Last updated · 559 distilled / 1,651 read