Segment — $0.6M/year savings by using S3 for change-data-capture for DynamoDB¶

Summary¶

Twilio Segment (2024-08-01) posts a cost-and-consolidation retrospective on their objects pipeline — the service that stores current state of every Segment object in a ~1 PetaByte DynamoDB table (~958 billion items as of writing, up from 399B "back then") and feeds changes to downstream warehouse integrations. The article frames the V1 design: a Producer service writes authoritative state to DynamoDB and simultaneously maintains a changelog in Google Cloud BigTable — a cross-cloud changelog store used to answer the query "what DynamoDB items were created or modified since timestamp T?" on behalf of downstream warehouse integrations. The load-bearing design question the post crystallises is: why not a DynamoDB Global Secondary Index (GSI) on (item-id, timestamp)? The answer is cost. At ~1 PB of base-table storage and $0.25/GB per month, a GSI would roughly double the storage bill for the whole pipeline — an unacceptable operational cost for a changelog whose only read pattern is "scan by modified-time since T". V2 of the pipeline migrates the changelog off BigTable and onto Amazon S3, consolidating infrastructure to AWS (cross-cloud egress elimination) and cutting ~$0.6M/year of costs while also simplifying the component count. The scraped raw markdown is truncated to ~37 lines — it ends immediately after naming BigTable as the V1 changelog and does not cover the V2 S3 mechanism, layout, partitioning, compaction, or failure semantics. The ingestion canonicalises what the post disclosed before truncation and flags the mechanism gaps as caveats.

Key takeaways¶

Segment's objects pipeline stores Segment object state in a ~1 PB DynamoDB table (~958 billion items, growing) — authoritative state, with BigTable as a cross-cloud changelog feeding warehouse integrations. Verbatim: "the objects pipeline processes hundreds of thousands of messages per second and stores the data state in a DynamoDB table. This data is used by the warehouse integrations to keep the customer warehouses up-to-date. The system originally consisted of a Producer service, DynamoDB, and BigTable. We had this configuration for the longest time, with DynamoDB and BigTable being key components which powered our batch pipeline." First wiki datum on Segment's objects pipeline at petabyte scale. (Source: sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb)
Operational numbers disclosed (the core of the cost-driven redesign argument). Verbatim:
- "Avg size of item is 900 Bytes."
- "Monthly storage costs $0.25/GB."
- "Total size of the table today is ~1 PetaByte."
- "The total number of items back then was 399 Billion. Today it is 958 Billion and still growing." These are the canonical numbers: ~900-byte average item size, 958 billion items, ~1 PB table, $0.25/GB·month storage, item count more than doubling since V1. Baseline monthly storage cost implied: ~1,000,000 GB × $0.25 = ~$250,000/month ≈ $3M/year for the DynamoDB base table alone — context for the subsequent "don't double this via a GSI" argument.
The changelog's job description: give downstream warehouse integrations a query surface for newly-created-or-modified DynamoDB items. Verbatim: "The primary purpose of changelog in our pipeline was to provide the ability to query newly created/modified DynamoDB Items for downstream systems. In our case, these were warehouse integrations. Ideally, we could have easily achieved this using DynamoDB's Global Secondary Index which would minimally contain: an ID field which uniquely identifies a DynamoDB Item; a TimeStamp field for sorting and filtering." Canonicalised as changelog as secondary index — the changelog exists to answer a query the base table's primary key cannot: "items modified since T", sorted by modification time.
Cost-driven rejection of the GSI answer — Global Secondary Index as an anti-pattern at petabyte scale. Verbatim: "due to the very large size of our table, creating a GSI for the table is not cost-efficient." Canonicalised as GSI cost anti-pattern at petabyte scale — at ~1 PB of base-table storage at $0.25/GB·month, a GSI's storage footprint is large enough that the changelog-query surface cannot be paid for via an in-database secondary index. Secondary indexes that are operationally free on small OLTP tables stop being free once base-table storage crosses sufficient thresholds; the changelog must live outside the base storage engine. Fits the wiki's broader keep-index- and-base-data-separate framing at a new altitude (petabyte-scale DynamoDB).
V1 changelog lived in Google Cloud BigTable — a cross-cloud design. Verbatim: "In the V1 system, we used BigTable as our changelog as it suited our requirements quite well. It provided low- latency read and write access to data, which made it suitable for real-time applications." First wiki datum on Segment using BigTable operationally in a DynamoDB-centric pipeline. The design decision is structural: BigTable's time-ordered row-key ergonomics (row keys can encode <shard>#<timestamp>#<id>) make "scan items modified since T" a range scan rather than a full-table scan — the exact shape a changelog needs. The trade-off paid was cross-cloud operational complexity and egress cost: DynamoDB lives in AWS, BigTable lives in Google Cloud, and the Producer service has to write both in the commit path plus warehouse readers have to pull changelog data out of GCP.
V2 migrated the changelog from BigTable to S3, eliminating cross-cloud operational and egress cost and saving ~$0.6M/year. Verbatim: "We recently revamped our platform by migrating from BigTable to offset the growing cost concerns, consolidate infrastructure to AWS, and simplify the number of components. In this blog post, we will take a closer look at the objects pipeline, specifically focusing on how the changelog sub-system which powers warehouse products has evolved over time. We will also share how we reduced the operational footprint and achieved significant cost savings by migrating to S3 as our changelog data store." Canonicalised as object store as CDC log store — using immutable object storage as the durable substrate for a CDC changelog feeding downstream batch consumers, trading the random-read latency of a BigTable scan for the cheaper per-byte storage + list / range-get semantics of object storage. Fits the broader wiki trajectory of tiered storage to object store now applied to CDC changelogs rather than streaming broker logs or warehouse data.
Infrastructure consolidation to a single cloud was co-equal with the cost argument. Verbatim: "consolidate infrastructure to AWS, and simplify the number of components." Canonicalises [[concepts/cross- cloud-cost-consolidation|cross-cloud cost consolidation]] as a first-class driver of re-platform decisions — cross-cloud pipelines pay egress + operational-complexity cost on every boundary crossing, and consolidating to one cloud eliminates that category of cost outright. Segment's title framing "$0.6M/year savings" is the aggregate of both the S3-cheaper-than-BigTable delta and the cross-cloud-egress elimination.

Canonical verbatim claims¶

Pipeline framing: "the objects pipeline processes hundreds of thousands of messages per second and stores the data state in a DynamoDB table. This data is used by the warehouse integrations to keep the customer warehouses up-to-date."
Scale numbers: "Avg size of item is 900 Bytes. Monthly storage costs $0.25/GB. Total size of the table today is ~1 PetaByte. The total number of items back then was 399 Billion. Today it is 958 Billion and still growing."
Changelog purpose: "The primary purpose of changelog in our pipeline was to provide the ability to query newly created/modified DynamoDB Items for downstream systems."
GSI rejection: "due to the very large size of our table, creating a GSI for the table is not cost-efficient."
V1 changelog choice: "In the V1 system, we used BigTable as our changelog as it suited our requirements quite well. It provided low-latency read and write access to data, which made it suitable for real-time applications."
V2 motivation: "We recently revamped our platform by migrating from BigTable to offset the growing cost concerns, consolidate infrastructure to AWS, and simplify the number of components."

Systems, concepts, and patterns canonicalised¶

New canonical wiki pages:

systems/segment-objects-pipeline — Segment's petabyte- scale DynamoDB + changelog pipeline feeding warehouse integrations.
systems/google-bigtable — GCP-hosted wide-column store, used as V1 changelog.
concepts/changelog-as-secondary-index — what a CDC changelog actually provides: a secondary index on (item-id, modified-timestamp) that the base table's PK does not expose.
concepts/gsi-cost-anti-pattern-at-petabyte-scale — cost-driven rejection of DynamoDB GSI at petabyte scale.
concepts/cross-cloud-cost-consolidation — infrastructure consolidation to a single cloud as co-equal driver alongside absolute storage-unit-cost delta.
patterns/object-store-as-cdc-log-store — S3 (or equivalent) as the durable substrate for a CDC changelog feeding downstream batch consumers.

Wiki pages extended:

systems/dynamodb — first wiki datum on DynamoDB at ~1 PB / ~958 B-item scale + canonical GSI-cost-at- petabyte-scale rejection case.
systems/aws-s3 — canonical instance of S3 as the substrate for a CDC changelog (not object storage for user data; storage of an append-only modification stream with list + range-read semantics).
concepts/change-data-capture — adds the changelog-for-batch-warehouse-feed shape (as distinct from CDC for streaming / real-time invalidation), and adds Segment's cost-driven changelog-store-selection framing.
concepts/secondary-index — adds the petabyte-scale cost-bound on in-database secondary indexes.
companies/segment — new company page.

Operational numbers disclosed¶

Metric	Value
Messages / sec into pipeline	hundreds of thousands
DynamoDB table size	~1 PetaByte
DynamoDB item count (at V1)	399 billion
DynamoDB item count (at writing)	958 billion
Average item size	~900 bytes
Monthly storage price	$0.25 / GB
Annual savings from V2 migration	~$0.6 million

Derived: baseline monthly DynamoDB-storage cost ≈ 1,000,000 GB × $0.25 = ~$250,000/month ≈ $3M/year. A GSI would add approximately the same order of magnitude of storage cost on top (GSIs store a projection of the base table plus their index structure; at petabyte scale this is a material multiplier on the total bill).

Caveats¶

Raw file truncated at ~37 lines. The scraped markdown ends immediately after naming BigTable as the V1 changelog and does not include the V2 S3 mechanism — layout, partitioning scheme, file format, compaction cadence, read-path semantics for warehouse integrations, failure modes, rollout strategy, or measurement methodology for the $0.6M/year savings claim. The canonical primary source is the original post URL, which contains the full V2 architecture; the wiki can re-ingest if the scraper fetches the full body later.
No V2 mechanism disclosure in scraped body: S3 prefix layout, partition key (shard / hour / day?), file format (JSON? Parquet? newline-delimited?), write cadence (streaming PUT per change? buffered batch PUT?), compaction model (does V2 rewrite small files into larger ones? is there a manifest?), read consistency model (eventual only? read- your-writes for the producer? snapshot isolation per warehouse query?), and failure model (what happens if a PUT fails? duplicate detection?) are all elided in the scraped content.
No V2 cost breakdown: the $0.6M/year headline is the aggregate of cross-cloud egress elimination + storage-unit -price delta + compute delta on the producer and reader sides, but the scraped body does not decompose it. We cannot tell from the scraped content how much of the savings came from each source.
No producer-side write-commit semantics disclosed: in V1, the Producer writes to both DynamoDB and BigTable; in V2 presumably to DynamoDB + S3. The scraped content does not discuss whether the changelog write is synchronous-2PC-style, dual-write-with-failure-recovery, or DynamoDB-Streams-driven asynchronous materialisation (the latter would match the broader ecosystem's preferred CDC shape).
No warehouse-integration read path disclosed: what the warehouse-integration consumers do with the changelog — full re-read, incremental sync by timestamp, chunked by S3 prefix, coordinated via a manifest — is elided in the scraped content.
No latency numbers: V1 BigTable is described as "low-latency read and write" but no P50/P99 numbers are disclosed; V2 S3's latency profile under the warehouse-integration query pattern is not quantified. (S3 range-GET latency on 1-MB-ish objects is typically tens of milliseconds; BigTable random-read latency is single-digit milliseconds — this is a real axis of trade-off that the scraped body does not discuss.)
Author attribution absent from scraped body (scraper likely stripped byline).

Cross-source continuity¶

Companion to the wiki's broader CDC framing. Segment's objects-pipeline changelog is the canonical "CDC for batch warehouse feed" shape — distinct from the streaming-CDC shape of Redpanda Connect / Debezium / PlanetScale Connect (per-record low-latency delivery to downstream streaming consumers) and distinct from the cache-invalidation-CDC shape of Figma's LiveGraph. Warehouse-feed CDC is batch- oriented, periodic-scan-based, and dominated by storage cost per byte, because the data sits in the changelog for hours-to-days before being consumed.
Related to Canva's 2024-04-29 count-billions post, which also discussed DynamoDB as a scale-out OLTP store for event- shaped workloads and also moved processing out of the OLTP engine (into Snowflake). Segment's structural parallel is moving the secondary-index surface out of the OLTP engine rather than the processing layer.
Related to Expedia's MERGE INTO post: downstream of Segment's changelog, warehouse-integration consumers are likely doing the CDC-delta-to-current-state merge pattern Expedia canonicalised for Iceberg workloads.
Related to tiered-storage-to-object-store trajectory — the same economic argument (object storage at $0.02/GB·month vs base storage at $0.25/GB·month) that drives tiered storage in streaming brokers also drives CDC-changelog migrations from operational stores to object stores.

Source¶

Original: https://segment.com/blog/S3-for-changedatacapture-dynamodb-table/
Raw markdown: raw/segment/2024-08-01-06myear-savings-by-using-s3-for-changedatacapture-for-dynamo-44fca15b.md
Hacker News discussion: https://news.ycombinator.com/item?id=41131136