Skip to content

SYSTEM Cited by 1 source

Segment objects pipeline

The objects pipeline is Twilio Segment's service for storing authoritative state of every Segment "object" (a logical entity in their customer data platform) and feeding changes to downstream warehouse integrations that keep customer data warehouses up-to-date. (Source: sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb)

Scale (as of 2024-08-01)

  • Throughput: "hundreds of thousands of messages per second."
  • Base table: DynamoDB, ~1 PetaByte, 958 billion items (up from 399 billion at V1), average item size ~900 bytes.
  • Base-table storage cost: $0.25 / GB / month → implied ~$250,000 / month ≈ $3M / year for storage alone.

Architecture — V1 (three components)

Per the 2024-08-01 post: "The system originally consisted of a Producer service, DynamoDB, and BigTable."

 Ingest ── Producer ──┬──▶ DynamoDB          (authoritative current state, ~1 PB)
                     └──▶ BigTable (GCP)    (CDC changelog, for warehouse feed)
                          Warehouse integrations
                          (scan by modified-time since T)

Key design choices of V1:

  • Authoritative state in DynamoDB, chosen for its partition-key scale-out and operational ergonomics at hundreds-of-thousands-per-second write rate.
  • Changelog in BigTable, chosen because "it suited our requirements quite well. It provided low-latency read and write access to data, which made it suitable for real-time applications." BigTable's time-ordered row-key ergonomics make "items modified since T" a range scan rather than a full table scan.
  • Rejected alternative — DynamoDB GSI: "Ideally, we could have easily achieved this using DynamoDB's Global Secondary Index which would minimally contain: an ID field which uniquely identifies a DynamoDB Item; a TimeStamp field for sorting and filtering. But due to the very large size of our table, creating a GSI for the table is not cost-efficient."

Architecture — V2 (consolidated to AWS, S3 as changelog)

Per the 2024-08-01 post lede: "We recently revamped our platform by migrating from BigTable to offset the growing cost concerns, consolidate infrastructure to AWS, and simplify the number of components."

 Ingest ── Producer ──┬──▶ DynamoDB          (authoritative current state, ~1 PB)
                     └──▶ Amazon S3          (CDC changelog, AWS-native)
                          Warehouse integrations

Headline result: ~$0.6M / year of savings, from the composite of:

  1. Cross-cloud egress elimination (no more GCP→AWS data transfer on the read path).
  2. Storage-unit-cost delta (S3 is cheaper than BigTable per byte for a store with this access pattern).
  3. Simplified component count (one fewer operational substrate to run + one cloud boundary eliminated).

Canonicalised on the wiki as object store as CDC log store.

What's not disclosed (scraped-body truncation)

The raw markdown of the 2024-08-01 post is truncated at ~37 lines — it ends just after naming BigTable as the V1 changelog and does not cover:

  • S3 prefix layout, partition scheme, file format, compaction model.
  • Producer-side dual-write semantics (synchronous? DynamoDB- Streams-driven? failure recovery?).
  • Warehouse-integration read path (how consumers enumerate + deduplicate new changes).
  • Absolute latency numbers for either V1 or V2.
  • Migration mechanics (dual-running? cutover? back-fill?).
  • Decomposition of the $0.6M/year savings across its three components above.

These are flagged on the source page as caveats; a future re-scrape could lift the truncation.

Seen in

Last updated · 470 distilled / 1,213 read