Skip to content

PATTERN Cited by 1 source

Sliding-window rollup aggregation

Problem

A service needs to serve an aggregate metric (count, sum, histogram) over an ever-growing event log with low read latency, high write throughput, and preserved event-level auditability. Scanning the full event log on every read is too slow; aggregating at write time loses audit/recounting and pressures hot keys; pre-aggregating via a stream processor loses retention / reset semantics + partition-rebalancing pain.

Pattern

Continuously aggregate the event log in the background within a sliding window, checkpointing the result. Reads serve the checkpoint — accepting some seconds of staleness — and writes emit a light-weight rollup event telling the background aggregator which key needs re-aggregation.

Structure:

  1. Event store — each mutation is persisted as an individual event with a composite idempotency key + a time-partitioned schema (see patterns/bucketed-event-time-partitioning).
  2. Rollup store — a checkpoint per aggregated entity holding (lastRollupValue, lastRollupTs). Updated by the rollup pipeline.
  3. Rollup cache — optional fast-path for point-read latency; combines lastRollupValue + lastRollupTs in a single cached value to prevent mismatch.
  4. Rollup pipeline — a tier of worker instances running in-memory per-instance queues. Rollup events from the write + read paths land in the queues (see patterns/fire-and-forget-rollup-trigger). Workers batch, query the event store for events in the immutable window (lastRollupTs, now() - acceptLimit), aggregate, and write the new checkpoint + refresh the cache.

Immutable aggregation window is load-bearing

The pipeline aggregates events older than a safe margin behind now() so the window is by construction no longer receiving writes. This is the immutable aggregation window property. Consequences:

  • No distributed locking. Multiple rollup workers can aggregate the same key concurrently — they see the same events and compute the same value.
  • Safe horizontal scaling. Rebalance + rolling deploy of the rollup tier is safe; temporary duplication of work is correctness- preserving.
  • Bounded staleness — reads lag real time by acceptLimit + margin. Typically seconds.

For sub-second accuracy, layer a real-time delta on top: current = lastRollupValue + aggregate(events from lastRollupTs to now). Effective when the key is actively accessed so the delta window stays narrow. Netflix flags this as experimental in the Counter post.

Drain-vs-circulate discipline

Keys with different access patterns need different rollup policies:

  • Low-cardinality + frequently-accessed — keep in constant rollup circulation so the lastRollupTs stays close to now. Prevents next-access read from having to scan many time partitions.
  • High-cardinality + sparse-access — drain out of circulation once last-write-timestamp stops advancing. Otherwise the aggregator's in-memory queues grow without bound.

A last-write-timestamp column on the Rollup Store (written via Cassandra USING TIMESTAMP / LWW) discriminates between the two. Netflix's pipeline inspects it to decide whether to re-queue a counter.

When to use

  • Metrics that need accuracy + auditability, not just approximate counts.
  • Write throughput too high for per-key CAS or locking.
  • Read latency needs a point-read, not a scan.
  • Bounded staleness (seconds) is acceptable.

When NOT to use

  • Real-time (sub-second) accuracy is a hard requirement and you don't want the complexity of a real-time delta layer.
  • Hard durability + consistency + real-time atomic semantics (e.g. financial ledgers with strict external ordering) — prefer an in-place counter with CAS + conditional writes.
  • Only approximate counts needed — a simpler in-memory / cache-only Best-Effort counter is cheaper (see concepts/best-effort-vs-eventually-consistent-counter).

Seen in

Last updated · 319 distilled / 1,201 read