PATTERN Cited by 1 source
Object store as CDC log store¶
Pattern: when the CDC-changelog consumer is batch-oriented (warehouse sync, periodic snapshot rebuild, offline analytics), use immutable object storage — Amazon S3, Google Cloud Storage, Azure Blob — as the durable substrate for the changelog, rather than an always-on operational store like a wide-column database or a streaming broker.
When this fits¶
Fits when all of these hold:
- Base store is large enough that an in-database secondary index is priced out — see concepts/gsi-cost-anti-pattern-at-petabyte-scale.
- Changelog consumer is batch / periodic, not streaming. Warehouse-integration consumers reading once per hour or once per day tolerate S3-range-read latency (tens of milliseconds) without a problem; a streaming consumer expecting single-digit-millisecond per-record latency would not.
- Read pattern is range-scan-by-timestamp, which S3 supports natively via prefix-listing + ranged GET against time-partitioned object keys.
- Write rate fits object-storage PUT semantics — typically this means the producer batches changes into files rather than PUT-per-change. (S3 PUT latency is ~tens of ms and PUT cost is non-trivial per request.)
Canonical case — Segment V2¶
Segment's objects pipeline V2 moved its CDC changelog off Bigtable and onto S3. Rationale from the 2024-08-01 post: "We recently revamped our platform by migrating from BigTable to offset the growing cost concerns, consolidate infrastructure to AWS, and simplify the number of components." Result: ~$0.6M / year savings + elimination of cross-cloud egress + single-cloud consolidation. (Source: sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb)
The scraped raw markdown is truncated before the V2 mechanism section, so the exact prefix layout, file format, compaction policy, and read-path semantics are not wiki-canonicalised beyond the "object store as changelog store" pattern name.
Trade-offs vs alternatives¶
| Changelog substrate | Read latency | $/GB·mo | Cross-cloud ok? | Streaming ok? |
|---|---|---|---|---|
| In-DB GSI (DynamoDB) | single-digit ms | same as base (~$0.25/GB in Segment's case) | n/a (same DB) | yes |
| Wide-column external (Bigtable) | single-digit ms | ~$0.17+/GB | possible, egress-costly | yes |
| Streaming log (Kafka / Redpanda tiered) | single-digit ms hot, ms-range cold | variable | possible, egress-costly | yes |
| Object store (S3) | tens of ms | ~$0.02/GB | cheap if same cloud | no (batch only) |
The object-store answer is cheapest per byte and simplest operationally when the access pattern tolerates batch latency. It is the wrong answer when the consumer is latency-sensitive streaming — use a streaming log there.
Composes with¶
- concepts/tiered-storage-to-object-store — same underlying economic argument (object storage at ~$0.02/ GB is an order of magnitude cheaper than operational stores) now applied to CDC changelogs rather than streaming broker segments.
- patterns/tiered-storage-to-object-store — the analogous pattern in streaming brokers; this pattern is its sibling at the CDC-changelog altitude.
- concepts/changelog-as-secondary-index — the "CDC log is a secondary index" framing, of which "materialise it in an object store" is one concrete answer.
Seen in¶
- sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb — canonical wiki instance. Segment's V2 objects pipeline uses S3 as the changelog store for its petabyte-scale DynamoDB base table.