CONCEPT Cited by 1 source
GSI cost anti-pattern at petabyte scale¶
A DynamoDB Global Secondary Index (GSI) stores a projection of the base table keyed on a different attribute set, and is billed separately for storage + write capacity. On small tables this is ergonomic and nearly-free; at petabyte base-table scale, the GSI's storage footprint is material enough that a naive "just add a GSI" for a new query surface can add hundreds of thousands of dollars per year of pure storage cost — enough to motivate moving the query surface out of the database entirely.
Canonical case: Segment's objects pipeline¶
Segment's objects pipeline stores ~958 billion items in a ~1 PetaByte DynamoDB table at $0.25 / GB / month — roughly $250,000/month ≈ $3M/year of base-table storage. The team needed a secondary query surface ("items modified since timestamp T" — see concepts/changelog-as-secondary-index) for warehouse integrations. Verbatim: "Ideally, we could have easily achieved this using DynamoDB's Global Secondary Index which would minimally contain: an ID field which uniquely identifies a DynamoDB Item; a TimeStamp field for sorting and filtering. But due to the very large size of our table, creating a GSI for the table is not cost-efficient." (Source: sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb)
A GSI projecting (id, timestamp) from a 958 B-item, 900- byte-item table — even with a minimal projection — would add a material fraction of base storage cost in the GSI structure itself, plus ongoing write-capacity cost as every base-table write fans out to the GSI.
Why the anti-pattern crystallises at petabyte scale¶
Below petabyte scale, the absolute storage cost of a GSI is small enough to be a rounding error on the engineering-time cost of running a second system to materialise the same query surface. Above petabyte scale, the sign flips: the storage cost of the GSI is larger than the engineering-time cost of running an external changelog store, so the economically rational answer is to externalise the index.
Segment's actual answer — V1 externalised to Bigtable, V2 externalised to S3 — demonstrates both the generic shape (externalise the index) and the inner second-order trade-off (which externalised-index substrate to pick, driven by cross-cloud cost and access-pattern fit). See patterns/object-store-as-cdc-log-store.
Generalisation beyond DynamoDB GSI¶
The shape generalises to any per-byte-priced secondary index in any database:
- DynamoDB GSIs / LSIs.
- RDS / Aurora secondary indexes (storage is smaller proportionally because row compression is tighter, but the same economic axis exists).
- Cassandra secondary indexes + materialised views.
- MongoDB secondary indexes.
The anti-pattern name is petabyte-scale-specific because that's where the sign flips in practice — but the underlying principle "externalise a secondary index when its per-byte storage cost exceeds the external-system operational cost" is general. Below the sign-flip threshold the in-database index wins; above it, the externalised changelog / search-index / materialised-view wins.
Seen in¶
- sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodb — canonical wiki disclosure, with numbers: ~1 PB base, 958 B items, $0.25/GB/month, and explicit verbatim rejection of GSI on cost grounds.