Skip to content

CONCEPT Cited by 1 source

Changelog as secondary index

A CDC changelog's load-bearing job description is: be a secondary index on the base table, keyed on (modified-timestamp, item-id), that answers the query "what items changed since timestamp T?" — a query the base table's primary key does not expose.

Canonical framing

From Segment's 2024-08-01 objects-pipeline post: "The primary purpose of changelog in our pipeline was to provide the ability to query newly created/modified DynamoDB Items for downstream systems. In our case, these were warehouse integrations. Ideally, we could have easily achieved this using DynamoDB's Global Secondary Index which would minimally contain: an ID field which uniquely identifies a DynamoDB Item; a TimeStamp field for sorting and filtering." (Source: sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb)

The verbatim "ideally ... GSI" framing is the load-bearing frame: the changelog is a secondary index whose index entries happen to live outside the base store rather than inside it.

Why the change-log-as-index framing matters

Most CDC discussion on this wiki is framed in terms of stream semantics: logical replication, binlog replication, change streams, Spanner change streams — per-record low-latency deltas consumed by streaming downstream systems. The Segment framing surfaces a different altitude: the CDC log is not fundamentally about streaming, it is fundamentally about offering a query surface the base table's PK cannot. For warehouse- integration batch consumers, that query surface is "items modified since T" — and the CDC log is simply the cheapest place to materialise it.

Structural implication

The canonical database answer to "items modified since T" is a secondary index on the modification timestamp. When that index is operationally free (OLTP-scale tables, OLTP-scale write rate), put it in the database. When that index's storage cost crosses a threshold (see concepts/gsi-cost-anti-pattern-at-petabyte-scale), the cheaper answer is to materialise the index outside the base store — as a CDC changelog, living in:

  • A time-ordered wide-column store like Bigtable (Segment V1), or
  • Immutable object storage like S3 (Segment V2, canonicalised as patterns/object-store-as-cdc-log-store), or
  • A streaming log (Kafka / Redpanda), which is how the streaming-CDC ecosystem makes this same trade-off.

All three are "the changelog is a secondary index whose entries live outside the base store" — differing only in the storage-technology choice for the externalised index.

Seen in

Last updated · 470 distilled / 1,213 read