Skip to content

PATTERN Cited by 1 source

Jittered flush for write smoothing

Pattern. When many hosts run identical background flush loops on a fixed interval, a randomised per-host delay on the first flush breaks cohort synchronisation and converts a spiky ingestion-backend write pattern into a uniform one. No other change to the flush schedule is needed; subsequent flushes inherit the per-host phase offset automatically.

Canonical wiki instance from Meta's 2024-12-02 cryptographic monitoring post: "To normalize our write throughput, we distribute these spikes across time by applying a randomized delay on a per-host basis before logs are flushed for the first time. This leads to a more uniform flushing cadence, allowing for a more consistent load on Scribe."

Problem

A fleet of N hosts each runs a periodic flush thread with interval T. If any event synchronises the hosts' start times — "certain clients who were running jobs that restarted large sets of machines at around the same time" — then all N hosts flush at the same phase. The ingestion backend sees spikes of ~N writes at intervals of T, followed by long troughs.

Spikes are bad:

  • Capacity planning must absorb the peak, which is much higher than the mean — wasteful.
  • Spike writes trigger admission control or queue exhaustion at the ingestion layer, amplifying into latency or drops.
  • Downstream storage tiers (warm store, cold store) see the same spiky pattern — Meta discloses "occasionally put an increased load on Scuba" from this same dynamic.

Fix

Each host, on startup, computes a random delay D ∈ [0, T) and delays its first flush by D. After the first flush all subsequent flushes happen every T — each host stays on its random phase relative to T. For N hosts with uniform D, the ingestion backend sees a uniform arrival rate of N/T writes/second instead of N writes every T seconds.

Why this is sufficient

The fixed interval T creates a drift-free clock; the random initial offset sets each host's phase once and persists. No continuous clock-sync or coordination is needed after that. If a host restarts, it picks a new D — which if anything adds more entropy to the fleet's phase distribution.

Distinct from

  • Exponential-backoff retry jitter (AWS SDK, HTTP clients): jitters per retry to defeat re-synchronised retry storms. Different failure mode, different scope (per-request, not per-host).
  • Token-bucket rate limiting: smooths one client's burst, not a cohort's aligned arrivals.
  • Batching + coalescing: smooths arrival rate by widening batch intervals; doesn't address cohort synchronisation.

Jittered flush for write smoothing is specifically about defeating cohort-synchronised cadence on shared ingestion — a failure mode that only shows up in fleet-wide telemetry architectures.

When to apply

  • A fleet of similar-role hosts runs the same background flush cadence with the same interval T.
  • The hosts start at approximately the same time due to cluster-wide job restarts, deployments, or region-wide events.
  • The downstream ingestion backend is sensitive to spikes — either it has admission control, or its storage layer degrades under bursty writes.

This is essentially every fleet-wide aggregating buffered logger deployment — Meta surfaces it explicitly in its FBCrypto telemetry post, but the same dynamic applies to any periodic flush loop (metrics exporters, traces-buffered-reporters, audit-log shippers).

Operational notes

  • Delay D should be uniform over [0, T). Non-uniform distributions create bias; fixed D collapses back to the synchronised case.
  • Per-process independent random seed. Using a deterministic seed (e.g. hostname hash) partially defeats the jitter — co-located processes would pick the same D.
  • Interval T itself is a separate tuning knob — shortening T reduces flush-to-ingestion lag at the cost of higher per-flush overhead (map-clear + background-thread work); this pattern is orthogonal to the T choice.

Seen in

  • sources/2024-12-02-meta-built-large-scale-cryptographic-monitoring — canonical wiki instance. Meta applies per-host first-flush randomisation inside FBCrypto's buffered logger specifically because "certain clients who were running jobs that restarted large sets of machines at around the same time would have those machines' logs get flushed at about the same time. This would result in 'spiky' writes to the logging platform."
Last updated · 319 distilled / 1,201 read