PATTERN Cited by 2 sources

Bucketed event-time partitioning¶

Problem¶

Persisting high-throughput time-series events keyed by an identifier (counter, user, device, sensor) in Cassandra (or any partitioned wide-column store) naturally concentrates all of a given identifier's events in one partition. At high event rates that partition becomes a wide partition — slow reads, read-repair pain, compaction pressure, memory footprint blowup on scans.

Pattern¶

Split each identifier's event partition into explicit buckets via schema columns derived from event time + a hash. Events for the same identifier spread across (identifier, time_bucket, event_bucket) partitions; reads issue parallel range scans across the buckets they need.

Canonical schema shape (Netflix TimeSeries Abstraction):

PRIMARY KEY (
  (identifier, time_bucket, event_bucket),  -- partition key
  event_time,                                -- clustering
  event_id,
  event_item_key
)

time_bucket slots events into a coarse time window (e.g. 10 minutes). event_bucket hash-spreads within a time window to a configurable number of sub-partitions per identifier.

Control-plane tuning¶

Three dials, typically per-namespace:

seconds_per_bucket — width of a time bucket. Smaller for low cardinality (keeps partitions small); larger for high cardinality (fewer partitions to scan on aggregation).
buckets_per_id — number of event-bucket hash partitions per identifier per time bucket. Higher for high-throughput single identifiers; lower when throughput is spread across identifiers.
seconds_per_slice — width of a time-slice table (e.g. 1 day). Slices govern table-level TTL / compaction and the boundary beyond which an older table can be dropped wholesale.

Netflix's low-cardinality example config: buckets_per_id: 4, seconds_per_bucket: 600 (10 min), seconds_per_slice: 86400 (1 day).

Trade-offs¶

Pros:

Prevents per-identifier wide partitions. Each partition stays small + efficiently compactable.
Enables parallel range-scan aggregation — a rollup batch can query all buckets for an identifier concurrently.
Per-namespace tuning lets hot + cold workloads coexist on the same physical cluster.

Cons:

Reads must scan N partitions per identifier instead of 1. Good aggregation query planners amortise this via parallelism + adaptive batch sizing, but the worst-case concurrent partition read count can overwhelm the store under bursty load.
Schema coupling — the right buckets_per_id depends on throughput + cardinality. Miscalibration shows up as either still- wide partitions (too few buckets) or unnecessary read fan-out (too many buckets).
Range-query cost scales with bucket count for full-range scans (e.g. historical backfill).

Netflix's Counter-Rollup tier uses dynamic batching to match the bucket count the batch will scan to the cardinality of counters in the batch, preventing the underlying store from seeing a thundering-herd of parallel range reads.

When to use¶

High-throughput time-series events partitioned by a natural identifier key.
Cassandra / wide-column / partitioned storage where per-key partition-size limits are a hard operational concern.
Aggregation workloads that can query ranges in parallel.

When NOT to use¶

Low-throughput per-identifier workloads — one partition per identifier is fine, the scheme adds needless complexity.
Stores that already shard-and-compact horizontally inside a partition (some LSM stores do; pure Cassandra doesn't).

Seen in¶

sources/2024-11-13-netflix-netflixs-distributed-counter-abstraction — canonical wiki instance. time_bucket + event_bucket columns in the TimeSeries schema; per-namespace control-plane tuning; Counter-Rollup uses parallel range scans across buckets; dynamic batching to prevent parallel-read thundering herd.
sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer — KV DAL's transparent-chunking story is a sibling anti-wide- partition pattern, splitting large values into chunks rather than events-in-time into buckets.