Skip to content

PATTERN Cited by 1 source

Object store as CDC log store

Pattern: when the CDC-changelog consumer is batch-oriented (warehouse sync, periodic snapshot rebuild, offline analytics), use immutable object storage — Amazon S3, Google Cloud Storage, Azure Blob — as the durable substrate for the changelog, rather than an always-on operational store like a wide-column database or a streaming broker.

When this fits

Fits when all of these hold:

  1. Base store is large enough that an in-database secondary index is priced out — see concepts/gsi-cost-anti-pattern-at-petabyte-scale.
  2. Changelog consumer is batch / periodic, not streaming. Warehouse-integration consumers reading once per hour or once per day tolerate S3-range-read latency (tens of milliseconds) without a problem; a streaming consumer expecting single-digit-millisecond per-record latency would not.
  3. Read pattern is range-scan-by-timestamp, which S3 supports natively via prefix-listing + ranged GET against time-partitioned object keys.
  4. Write rate fits object-storage PUT semantics — typically this means the producer batches changes into files rather than PUT-per-change. (S3 PUT latency is ~tens of ms and PUT cost is non-trivial per request.)

Canonical case — Segment V2

Segment's objects pipeline V2 moved its CDC changelog off Bigtable and onto S3. Rationale from the 2024-08-01 post: "We recently revamped our platform by migrating from BigTable to offset the growing cost concerns, consolidate infrastructure to AWS, and simplify the number of components." Result: ~$0.6M / year savings + elimination of cross-cloud egress + single-cloud consolidation. (Source: sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb)

The scraped raw markdown is truncated before the V2 mechanism section, so the exact prefix layout, file format, compaction policy, and read-path semantics are not wiki-canonicalised beyond the "object store as changelog store" pattern name.

Trade-offs vs alternatives

Changelog substrate Read latency $/GB·mo Cross-cloud ok? Streaming ok?
In-DB GSI (DynamoDB) single-digit ms same as base (~$0.25/GB in Segment's case) n/a (same DB) yes
Wide-column external (Bigtable) single-digit ms ~$0.17+/GB possible, egress-costly yes
Streaming log (Kafka / Redpanda tiered) single-digit ms hot, ms-range cold variable possible, egress-costly yes
Object store (S3) tens of ms ~$0.02/GB cheap if same cloud no (batch only)

The object-store answer is cheapest per byte and simplest operationally when the access pattern tolerates batch latency. It is the wrong answer when the consumer is latency-sensitive streaming — use a streaming log there.

Composes with

Seen in

Last updated · 470 distilled / 1,213 read