PATTERN Cited by 1 source

Object store as CDC log store¶

Pattern: when the CDC-changelog consumer is batch-oriented (warehouse sync, periodic snapshot rebuild, offline analytics), use immutable object storage — Amazon S3, Google Cloud Storage, Azure Blob — as the durable substrate for the changelog, rather than an always-on operational store like a wide-column database or a streaming broker.

When this fits¶

Fits when all of these hold:

Base store is large enough that an in-database secondary index is priced out — see concepts/gsi-cost-anti-pattern-at-petabyte-scale.
Changelog consumer is batch / periodic, not streaming. Warehouse-integration consumers reading once per hour or once per day tolerate S3-range-read latency (tens of milliseconds) without a problem; a streaming consumer expecting single-digit-millisecond per-record latency would not.
Read pattern is range-scan-by-timestamp, which S3 supports natively via prefix-listing + ranged GET against time-partitioned object keys.
Write rate fits object-storage PUT semantics — typically this means the producer batches changes into files rather than PUT-per-change. (S3 PUT latency is ~tens of ms and PUT cost is non-trivial per request.)

Canonical case — Segment V2¶

Segment's objects pipeline V2 moved its CDC changelog off Bigtable and onto S3. Rationale from the 2024-08-01 post: "We recently revamped our platform by migrating from BigTable to offset the growing cost concerns, consolidate infrastructure to AWS, and simplify the number of components." Result: ~$0.6M / year savings + elimination of cross-cloud egress + single-cloud consolidation. (Source: sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb)

The scraped raw markdown is truncated before the V2 mechanism section, so the exact prefix layout, file format, compaction policy, and read-path semantics are not wiki-canonicalised beyond the "object store as changelog store" pattern name.

Trade-offs vs alternatives¶

Changelog substrate	Read latency	$/GB·mo	Cross-cloud ok?	Streaming ok?
In-DB GSI (DynamoDB)	single-digit ms	same as base (~$0.25/GB in Segment's case)	n/a (same DB)	yes
Wide-column external (Bigtable)	single-digit ms	~$0.17+/GB	possible, egress-costly	yes
Streaming log (Kafka / Redpanda tiered)	single-digit ms hot, ms-range cold	variable	possible, egress-costly	yes
Object store (S3)	tens of ms	~$0.02/GB	cheap if same cloud	no (batch only)

The object-store answer is cheapest per byte and simplest operationally when the access pattern tolerates batch latency. It is the wrong answer when the consumer is latency-sensitive streaming — use a streaming log there.

Composes with¶

concepts/tiered-storage-to-object-store — same underlying economic argument (object storage at ~$0.02/ GB is an order of magnitude cheaper than operational stores) now applied to CDC changelogs rather than streaming broker segments.
patterns/tiered-storage-to-object-store — the analogous pattern in streaming brokers; this pattern is its sibling at the CDC-changelog altitude.
concepts/changelog-as-secondary-index — the "CDC log is a secondary index" framing, of which "materialise it in an object store" is one concrete answer.

Seen in¶

sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb — canonical wiki instance. Segment's V2 objects pipeline uses S3 as the changelog store for its petabyte-scale DynamoDB base table.