PATTERN Cited by 1 source

Fire-and-forget rollup trigger¶

Problem¶

Writes to an event-log-based aggregation system need to schedule a background rollup for the affected key, but must not block on the rollup — the write path is latency-critical and the rollup tier runs asynchronously. Also, losing the trigger must not lose data — the underlying event store is already durable.

Pattern¶

After durably persisting the event, emit a light-weight rollup event to the rollup tier fire-and-forget. The write completes the moment the event is durable in the event store; the rollup trigger is a best-effort signal saying "this key has changed, please re-aggregate".

Canonical shape (Netflix Distributed Counter):

Client calls AddCount(namespace, counter, delta, token).
Service writes event to TimeSeries durably.
Service updates last-write-timestamp on the Rollup Store (Cassandra USING TIMESTAMP = event's event_time) — this is LWW and also durable.
Service sends {namespace, counter} to the Rollup tier fire- and-forget (no ACK, no retry).
Returns success to client.

The rollup trigger is handled asynchronously by the Counter-Rollup server tier — see patterns/sliding-window-rollup-aggregation.

Why fire-and-forget works¶

Three properties make the trigger safe to drop:

The event is already durable in the event store — the primary source of truth isn't the trigger, it's the event log.
Reads emit triggers too — Netflix's GetCount also fires a rollup event, so an infrequently-accessed counter whose write- path trigger was lost self-heals on the next read.
last-write-timestamp drives rollup circulation. Counters whose pending events haven't been aggregated stay in circulation until they catch up, giving the rollup tier independent signal besides the in-memory trigger queue.

Drop rate in steady state is low because the trigger is an in-process/in-cluster delivery — but even under instance crash, the only cost is delayed aggregation. No data loss, no double- counting.

Where it can bite¶

Netflix's post names three caveats:

In-memory queues lose triggers on instance crash — first- version Counter uses simple in-memory queues "to reduce provisioning complexity, save on infrastructure costs, and make rebalancing fairly straightforward." Named as future work: durable queues + rollup handoffs.
Infrequently-accessed counters can stay stale longer, because there's no read to self-heal.
No observability of dropped triggers — fire-and-forget by definition has no ACK.

When to use¶

Event-log-based aggregation where the event store is already the durable record.
Rollup tier is sized for bursts and doesn't need strong ordering of triggers.
Bounded-staleness reads are acceptable.

When NOT to use¶

Rollup must run before the write is acknowledged (e.g. a contract requires the aggregate to reflect the new value immediately) — prefer synchronous aggregation or an in-place counter with CAS.
No independent signal (read-triggered, last-write-timestamp) exists for self-healing — you'd risk permanent staleness on dropped triggers.

Seen in¶

sources/2024-11-13-netflix-netflixs-distributed-counter-abstraction — canonical wiki instance. Post-durability rollup trigger on writes + read-triggered rollups + last-write-timestamp as independent signal.