PATTERN Cited by 1 source
Sliding-window rollup aggregation¶
Problem¶
A service needs to serve an aggregate metric (count, sum, histogram) over an ever-growing event log with low read latency, high write throughput, and preserved event-level auditability. Scanning the full event log on every read is too slow; aggregating at write time loses audit/recounting and pressures hot keys; pre-aggregating via a stream processor loses retention / reset semantics + partition-rebalancing pain.
Pattern¶
Continuously aggregate the event log in the background within a sliding window, checkpointing the result. Reads serve the checkpoint — accepting some seconds of staleness — and writes emit a light-weight rollup event telling the background aggregator which key needs re-aggregation.
Structure:
- Event store — each mutation is persisted as an individual event with a composite idempotency key + a time-partitioned schema (see patterns/bucketed-event-time-partitioning).
- Rollup store — a checkpoint per aggregated entity holding
(lastRollupValue, lastRollupTs). Updated by the rollup pipeline. - Rollup cache — optional fast-path for point-read latency;
combines
lastRollupValue+lastRollupTsin a single cached value to prevent mismatch. - Rollup pipeline — a tier of worker instances running
in-memory per-instance queues. Rollup events from the write +
read paths land in the queues (see
patterns/fire-and-forget-rollup-trigger). Workers batch,
query the event store for events in the immutable window
(lastRollupTs, now() - acceptLimit), aggregate, and write the new checkpoint + refresh the cache.
Immutable aggregation window is load-bearing¶
The pipeline aggregates events older than a safe margin behind
now() so the window is by construction no longer receiving
writes. This is the
immutable aggregation
window property. Consequences:
- No distributed locking. Multiple rollup workers can aggregate the same key concurrently — they see the same events and compute the same value.
- Safe horizontal scaling. Rebalance + rolling deploy of the rollup tier is safe; temporary duplication of work is correctness- preserving.
- Bounded staleness — reads lag real time by
acceptLimit + margin. Typically seconds.
For sub-second accuracy, layer a real-time delta on top:
current = lastRollupValue + aggregate(events from lastRollupTs to
now). Effective when the key is actively accessed so the delta
window stays narrow. Netflix flags this as experimental in the
Counter post.
Drain-vs-circulate discipline¶
Keys with different access patterns need different rollup policies:
- Low-cardinality + frequently-accessed — keep in constant
rollup circulation so the
lastRollupTsstays close tonow. Prevents next-access read from having to scan many time partitions. - High-cardinality + sparse-access — drain out of circulation
once
last-write-timestampstops advancing. Otherwise the aggregator's in-memory queues grow without bound.
A last-write-timestamp column on the Rollup Store
(written via Cassandra USING TIMESTAMP / LWW) discriminates
between the two. Netflix's pipeline inspects it to decide whether
to re-queue a counter.
When to use¶
- Metrics that need accuracy + auditability, not just approximate counts.
- Write throughput too high for per-key CAS or locking.
- Read latency needs a point-read, not a scan.
- Bounded staleness (seconds) is acceptable.
When NOT to use¶
- Real-time (sub-second) accuracy is a hard requirement and you don't want the complexity of a real-time delta layer.
- Hard durability + consistency + real-time atomic semantics (e.g. financial ledgers with strict external ordering) — prefer an in-place counter with CAS + conditional writes.
- Only approximate counts needed — a simpler in-memory / cache-only Best-Effort counter is cheaper (see concepts/best-effort-vs-eventually-consistent-counter).
Seen in¶
- sources/2024-11-13-netflix-netflixs-distributed-counter-abstraction — canonical wiki instance. TimeSeries as event store; Cassandra-backed Rollup Store keyed by namespace; EVCache rollup cache; Counter-Rollup server tier with in-memory queues; XXHash routing; Set-based coalescing; adaptive back-pressure; last-write-timestamp as drain-vs-circulate signal.