PATTERN Cited by 2 sources
Bucketed event-time partitioning¶
Problem¶
Persisting high-throughput time-series events keyed by an identifier (counter, user, device, sensor) in Cassandra (or any partitioned wide-column store) naturally concentrates all of a given identifier's events in one partition. At high event rates that partition becomes a wide partition — slow reads, read-repair pain, compaction pressure, memory footprint blowup on scans.
Pattern¶
Split each identifier's event partition into explicit buckets
via schema columns derived from event time + a hash. Events for the
same identifier spread across
(identifier, time_bucket, event_bucket) partitions; reads issue
parallel range scans across the buckets they need.
Canonical schema shape (Netflix TimeSeries Abstraction):
PRIMARY KEY (
(identifier, time_bucket, event_bucket), -- partition key
event_time, -- clustering
event_id,
event_item_key
)
time_bucket slots events into a coarse time window (e.g.
10 minutes). event_bucket hash-spreads within a time window to a
configurable number of sub-partitions per identifier.
Control-plane tuning¶
Three dials, typically per-namespace:
seconds_per_bucket— width of a time bucket. Smaller for low cardinality (keeps partitions small); larger for high cardinality (fewer partitions to scan on aggregation).buckets_per_id— number of event-bucket hash partitions per identifier per time bucket. Higher for high-throughput single identifiers; lower when throughput is spread across identifiers.seconds_per_slice— width of a time-slice table (e.g. 1 day). Slices govern table-level TTL / compaction and the boundary beyond which an older table can be dropped wholesale.
Netflix's low-cardinality example config: buckets_per_id: 4,
seconds_per_bucket: 600 (10 min), seconds_per_slice: 86400 (1
day).
Trade-offs¶
Pros:
- Prevents per-identifier wide partitions. Each partition stays small + efficiently compactable.
- Enables parallel range-scan aggregation — a rollup batch can query all buckets for an identifier concurrently.
- Per-namespace tuning lets hot + cold workloads coexist on the same physical cluster.
Cons:
- Reads must scan N partitions per identifier instead of 1. Good aggregation query planners amortise this via parallelism + adaptive batch sizing, but the worst-case concurrent partition read count can overwhelm the store under bursty load.
- Schema coupling — the right
buckets_per_iddepends on throughput + cardinality. Miscalibration shows up as either still- wide partitions (too few buckets) or unnecessary read fan-out (too many buckets). - Range-query cost scales with bucket count for full-range scans (e.g. historical backfill).
Netflix's Counter-Rollup tier uses dynamic batching to match the bucket count the batch will scan to the cardinality of counters in the batch, preventing the underlying store from seeing a thundering-herd of parallel range reads.
When to use¶
- High-throughput time-series events partitioned by a natural identifier key.
- Cassandra / wide-column / partitioned storage where per-key partition-size limits are a hard operational concern.
- Aggregation workloads that can query ranges in parallel.
When NOT to use¶
- Low-throughput per-identifier workloads — one partition per identifier is fine, the scheme adds needless complexity.
- Stores that already shard-and-compact horizontally inside a partition (some LSM stores do; pure Cassandra doesn't).
Seen in¶
- sources/2024-11-13-netflix-netflixs-distributed-counter-abstraction
— canonical wiki instance.
time_bucket+event_bucketcolumns in the TimeSeries schema; per-namespace control-plane tuning; Counter-Rollup uses parallel range scans across buckets; dynamic batching to prevent parallel-read thundering herd. - sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer — KV DAL's transparent-chunking story is a sibling anti-wide- partition pattern, splitting large values into chunks rather than events-in-time into buckets.