PATTERN Cited by 1 source

Sliding-window rollup aggregation¶

Problem¶

A service needs to serve an aggregate metric (count, sum, histogram) over an ever-growing event log with low read latency, high write throughput, and preserved event-level auditability. Scanning the full event log on every read is too slow; aggregating at write time loses audit/recounting and pressures hot keys; pre-aggregating via a stream processor loses retention / reset semantics + partition-rebalancing pain.

Pattern¶

Continuously aggregate the event log in the background within a sliding window, checkpointing the result. Reads serve the checkpoint — accepting some seconds of staleness — and writes emit a light-weight rollup event telling the background aggregator which key needs re-aggregation.

Structure:

Event store — each mutation is persisted as an individual event with a composite idempotency key + a time-partitioned schema (see patterns/bucketed-event-time-partitioning).
Rollup store — a checkpoint per aggregated entity holding (lastRollupValue, lastRollupTs). Updated by the rollup pipeline.
Rollup cache — optional fast-path for point-read latency; combines lastRollupValue + lastRollupTs in a single cached value to prevent mismatch.
Rollup pipeline — a tier of worker instances running in-memory per-instance queues. Rollup events from the write + read paths land in the queues (see patterns/fire-and-forget-rollup-trigger). Workers batch, query the event store for events in the immutable window (lastRollupTs, now() - acceptLimit), aggregate, and write the new checkpoint + refresh the cache.

Immutable aggregation window is load-bearing¶

The pipeline aggregates events older than a safe margin behind now() so the window is by construction no longer receiving writes. This is the immutable aggregation window property. Consequences:

No distributed locking. Multiple rollup workers can aggregate the same key concurrently — they see the same events and compute the same value.
Safe horizontal scaling. Rebalance + rolling deploy of the rollup tier is safe; temporary duplication of work is correctness- preserving.
Bounded staleness — reads lag real time by acceptLimit + margin. Typically seconds.

For sub-second accuracy, layer a real-time delta on top: current = lastRollupValue + aggregate(events from lastRollupTs to now). Effective when the key is actively accessed so the delta window stays narrow. Netflix flags this as experimental in the Counter post.

Drain-vs-circulate discipline¶

Keys with different access patterns need different rollup policies:

Low-cardinality + frequently-accessed — keep in constant rollup circulation so the lastRollupTs stays close to now. Prevents next-access read from having to scan many time partitions.
High-cardinality + sparse-access — drain out of circulation once last-write-timestamp stops advancing. Otherwise the aggregator's in-memory queues grow without bound.

A last-write-timestamp column on the Rollup Store (written via Cassandra USING TIMESTAMP / LWW) discriminates between the two. Netflix's pipeline inspects it to decide whether to re-queue a counter.

When to use¶

Metrics that need accuracy + auditability, not just approximate counts.
Write throughput too high for per-key CAS or locking.
Read latency needs a point-read, not a scan.
Bounded staleness (seconds) is acceptable.

When NOT to use¶

Real-time (sub-second) accuracy is a hard requirement and you don't want the complexity of a real-time delta layer.
Hard durability + consistency + real-time atomic semantics (e.g. financial ledgers with strict external ordering) — prefer an in-place counter with CAS + conditional writes.
Only approximate counts needed — a simpler in-memory / cache-only Best-Effort counter is cheaper (see concepts/best-effort-vs-eventually-consistent-counter).

Seen in¶

sources/2024-11-13-netflix-netflixs-distributed-counter-abstraction — canonical wiki instance. TimeSeries as event store; Cassandra-backed Rollup Store keyed by namespace; EVCache rollup cache; Counter-Rollup server tier with in-memory queues; XXHash routing; Set-based coalescing; adaptive back-pressure; last-write-timestamp as drain-vs-circulate signal.