Skip to content

CONCEPT Cited by 1 source

Schema evolution

Definition

Schema evolution is the problem of changing the structure of data (records, tables, messages) over time while old and new versions coexist in the system — in flight on a queue, at rest in a log, being read by older consumers that have not yet redeployed. It is the hard problem in any long-lived data pipeline: the schema is not a static contract but a moving one with backwards- and forwards-compatibility obligations.

Why it's hard in async CDC

In an async CDC pipeline, there are at minimum three clocks that can advance independently:

  1. The source database's DDL clock (schema migrations applied).
  2. The in-flight records' serialised schema version (records already written to a Kafka topic under schema v1 cannot be retroactively rewritten when the producer upgrades to v2).
  3. Each consumer's deploy clock (a sink connector or a custom downstream app may be pinned to v1 when producers switch to v2).

A schema change that looks "trivial" at the DDL layer — e.g. ALTER TABLE ... ALTER COLUMN foo SET NOT NULL — can silently break every in-flight record where foo happened to be null. That null-valued record was valid under v1, was already serialised and published, and will surface at a consumer that now expects a non-null.

Datadog's framing and two-layer answer

Datadog's 2025-11-04 retrospective names schema evolution as "one of the key challenges with asynchronous replication":

"Even with schema changes in the source datastore, our platform needs to ensure that change events can be reliably replicated downstream." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)

Their answer is a two-layer solution:

Layer 1 — validate before apply (offline)

An internal automated schema management validation system analyses schema migration SQL before it's applied to the database. It catches pipeline-breaking changes; Datadog's canonical example:

"We would want to block a schema change like ALTER TABLE ... ALTER COLUMN ... SET NOT NULL because not all messages in the pipeline are guaranteed to populate that column. If a consumer gets a message where the field was null, the replication could break. Our validation checks allow us to approve most changes without manual intervention. For breaking changes, we work directly with the team to coordinate a safe rollout." (Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)

This layer is captured as patterns/schema-validation-before-deploy.

Layer 2 — registry-enforced backward compat (runtime)

A multi-tenant Kafka Schema Registry integrated with source + sink connectors, configured for backward compatibility — new schemas must still let older consumers read data. In practice this limits schema changes to adding optional fields or removing existing fields. When Debezium captures an updated schema, it serialises data to Avro and pushes data + schema to Kafka topic + Schema Registry; the registry compares against the stored schema and accepts or rejects. This layer is captured as patterns/schema-registry-backward-compat.

Composition

Offline validation catches the pipeline-breaking class before it reaches production. Runtime registry catches the residual class that slip through (or pre-deploy review missed). Combined, they allow the source team to ship routine schema migrations without manual platform-team coordination, reserved for the breaking-change class.

Compatibility modes (canonical vocabulary)

Mode Rule Allows
Backward New schema readable by old consumers Add optional fields; remove fields
Forward Old schema readable by new consumers Remove optional fields; add fields with defaults
Full Both directions Additive optional changes only
None No checking Anything (unsafe)

Backward is the default for CDC pipelines because the consumer fleet is typically the slower-to-redeploy side — a producer upgrade must not break in-flight data destined for older-version consumers.

Seen in

Last updated · 200 distilled / 1,178 read