CONCEPT Cited by 1 source
Telemetry buffer-and-flush¶
Buffer-and-flush is the telemetry pattern in which events are aggregated in-process on each emitter host, and the aggregated state is periodically flushed to the ingestion backend. It is the standard alternative to sampling when the raw event rate is too high for per-event logging but full-population fidelity is required.
Canonical wiki framing from Meta's 2024-12-02 cryptographic monitoring post, where it is the architectural choice that defeats the per-event cost of logging "millions of cryptographic operations per day" per host without losing the ability to see every key / algorithm combination in the fleet inventory.
The sampling vs buffer-and-flush choice¶
Two cost-reduction approaches when event rate exceeds what the ingestion backend can absorb:
| Property | Sampling | Buffer-and-flush |
|---|---|---|
| Per-event overhead | Random-decision gate | Map lookup + counter increment |
| Data volume at ingestion | 1/N of raw |
Cardinality of aggregation key per flush interval |
| Full-population inventory visible? | No — rare keys may never be sampled | Yes — every unique event tuple appears |
| Per-event latency visibility | Preserved for sampled events | Lost — aggregation collapses timing |
| Per-event context preserved? | Yes (for sampled events) | No (only tuple + count) |
| Cost hinge | Fixed sampling ratio | Aggregation key cardinality vs raw event rate |
The Meta post's reasoning for preferring buffer-and-flush: "We felt strongly about not introducing any sampling since doing so would result in most logs being omitted, giving us a less clear picture of the library's usage."
When buffer-and-flush wins. The aggregation key has low cardinality relative to event rate — e.g. (key name, algorithm, method) for FBCrypto, where the same tuple fires millions of times per day per host. The per-flush row count is dramatically smaller than the per-operation count; compression ratios can reach 6+ orders of magnitude.
When sampling wins. The aggregation key has high cardinality (e.g. per-request-ID events, per-user events where user is the key) — no meaningful compression. Or per-event context (timing, payload) matters more than population counts, so lossy-but-per-event-preserved is preferred over lossless-but-count-only.
Mechanics¶
- In-process accumulator. A data structure keyed on the aggregation tuple, with a counter (or aggregate statistics) per entry. At Meta this is folly::ConcurrentHashMap — the write-heavy-multithreaded-safe choice.
- On-event update. The emitting code path performs its work and does a map-lookup + counter-increment instead of an ingestion-framework call. Overhead shrinks from a network / serialisation call to a single write-locked hash-slot increment.
- Background flush loop. A background thread on a configurable period: reads the map, emits one ingestion event per entry including the count, clears the map.
- First-flush jitter. A randomised delay for the first flush per host prevents all hosts in a cohort from flushing at the same phase — see patterns/jittered-flush-for-write-smoothing. Canonicalised by Meta: "we distribute these spikes across time by applying a randomized delay on a per-host basis before logs are flushed for the first time."
- Shutdown-flush. A synchronous final flush on process exit drains remaining counts before termination. Non-trivial because the shutdown environment has constrained access to the ingestion framework's dependencies — see systems/folly-singleton.
Trade-offs¶
- Freshness lag = flush interval. Downstream dashboards see counts only up to the most recent flush; real-time alerting must tolerate this lag or trigger on the most recent flushed window.
- Lost events on crash between flushes. Unlike a synchronous-log architecture, a crash loses the in-memory counts. For cryptographic monitoring this is acceptable — trend-level accuracy is what matters, not per-event durability.
- Aggregation-key drift. If the aggregation key is too narrow, cardinality explodes and the pattern degrades to per-event logging. Meta's derived-key aggregation is the canonical example of cardinality discipline: aggregate child-key events under the parent key to prevent KDF-heavy features from inflating the key space.
- Schema rigidity. Event-tuple schema becomes central to the ingestion pipeline + downstream queries; schema evolution requires coordination across emitters + flushers + consumers.
Relation to sampling-based observability¶
Sampling-based observability pipelines (trace sampling, log sampling, APM-style) and buffer-and-flush pipelines coexist in modern telemetry stacks: sampling is applied to high-cardinality per-event-context data (traces with latency percentiles); buffer- and-flush is applied to low-cardinality high-frequency counter data (fleet-wide call-site inventories). They are not competitors — they are orthogonal axes of the observability-cost problem.
Seen in¶
- sources/2024-12-02-meta-built-large-scale-cryptographic-monitoring — canonical wiki disclosure. Meta's FBCrypto uses an in-process folly::ConcurrentHashMap keyed on event tuples, with a background flush thread through Scribe. The choice is framed in explicit contrast to sampling: "instead, the logging uses a 'buffering and flushing' strategy."
Related¶
- concepts/cryptographic-monitoring — the canonical consumer of buffer-and-flush at Meta.
- concepts/observability — the broader telemetry framing.
- concepts/unified-library-leverage — why library-resident buffer-and-flush is cheap at hyperscale.
- patterns/aggregating-buffered-logger — the concrete architectural pattern.
- patterns/jittered-flush-for-write-smoothing — the write-spike-smoothing discipline paired with this pattern.
- systems/fbcrypto, systems/scribe-meta, systems/folly-concurrenthashmap — Meta's concrete stack.