Skip to content

CONCEPT Cited by 1 source

GSI cost anti-pattern at petabyte scale

A DynamoDB Global Secondary Index (GSI) stores a projection of the base table keyed on a different attribute set, and is billed separately for storage + write capacity. On small tables this is ergonomic and nearly-free; at petabyte base-table scale, the GSI's storage footprint is material enough that a naive "just add a GSI" for a new query surface can add hundreds of thousands of dollars per year of pure storage cost — enough to motivate moving the query surface out of the database entirely.

Canonical case: Segment's objects pipeline

Segment's objects pipeline stores ~958 billion items in a ~1 PetaByte DynamoDB table at $0.25 / GB / month — roughly $250,000/month ≈ $3M/year of base-table storage. The team needed a secondary query surface ("items modified since timestamp T" — see concepts/changelog-as-secondary-index) for warehouse integrations. Verbatim: "Ideally, we could have easily achieved this using DynamoDB's Global Secondary Index which would minimally contain: an ID field which uniquely identifies a DynamoDB Item; a TimeStamp field for sorting and filtering. But due to the very large size of our table, creating a GSI for the table is not cost-efficient." (Source: sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb)

A GSI projecting (id, timestamp) from a 958 B-item, 900- byte-item table — even with a minimal projection — would add a material fraction of base storage cost in the GSI structure itself, plus ongoing write-capacity cost as every base-table write fans out to the GSI.

Why the anti-pattern crystallises at petabyte scale

Below petabyte scale, the absolute storage cost of a GSI is small enough to be a rounding error on the engineering-time cost of running a second system to materialise the same query surface. Above petabyte scale, the sign flips: the storage cost of the GSI is larger than the engineering-time cost of running an external changelog store, so the economically rational answer is to externalise the index.

Segment's actual answer — V1 externalised to Bigtable, V2 externalised to S3 — demonstrates both the generic shape (externalise the index) and the inner second-order trade-off (which externalised-index substrate to pick, driven by cross-cloud cost and access-pattern fit). See patterns/object-store-as-cdc-log-store.

Generalisation beyond DynamoDB GSI

The shape generalises to any per-byte-priced secondary index in any database:

  • DynamoDB GSIs / LSIs.
  • RDS / Aurora secondary indexes (storage is smaller proportionally because row compression is tighter, but the same economic axis exists).
  • Cassandra secondary indexes + materialised views.
  • MongoDB secondary indexes.

The anti-pattern name is petabyte-scale-specific because that's where the sign flips in practice — but the underlying principle "externalise a secondary index when its per-byte storage cost exceeds the external-system operational cost" is general. Below the sign-flip threshold the in-database index wins; above it, the externalised changelog / search-index / materialised-view wins.

Seen in

Last updated · 470 distilled / 1,213 read