Skip to content

META 2024-12-02

Read original ↗

Meta — How Meta built large-scale cryptographic monitoring

Summary

A 2024-12-02 Meta Engineering post (authored by the CryptoEng team — Grace Wu, Ilya Maykov, Srinivas Murri, Isaac Elbaz acknowledged) describing the telemetry system underneath FBCrypto, Meta's managed cryptographic library. The monitoring system logs every cryptographic operation across Meta's fleet — with no sampling — so that Meta can (a) detect key overuse and rotate proactively, (b) build a living inventory of which call-sites use which algorithms (essential for deprecating weakened primitives + staging post-quantum migration), and (c) use call-volume / success-rate as a proxy for client health during large-scale migrations. Naive per-event logging through Scribe would be cost-prohibitive — Meta discloses that "roughly 0.05% of CPU cycles at Meta are spent on X25519 key exchange" alone — so FBCrypto uses an in-process buffered-and-flushed aggregating logger: each cryptographic event increments a count keyed on (key-name, method, algorithm, …) in a folly::ConcurrentHashMap, and a background thread periodically flushes the counts to Scribe, which persists them to Scuba (warm) and Meta's Hive warehouse (cold). Three supporting optimisations are disclosed: partially randomised first-flush delay per host to break synchronised spikes across fleets that restart together; derived-key aggregation (count against the parent keyset rather than emit one row per KDF-derived child keyset) to cut logging volume for features that generate millions of keys; and reliance on folly::Singleton + synchronous shutdown flush to drain remaining buffered counts when a job exits. The post is also a clean canonical statement of the "unified offerings are key" pattern — Scribe + FBCrypto being fleet- wide means one implementation benefits the whole company.

Key takeaways

  1. Why cryptographic monitoring matters enough to log everything. Four named use cases: proactive detection / removal of weak algorithms, change-safety for library rollouts, key-overuse → key-rotation (symmetric keys have a finite data-volume budget before security degrades), and building an inventory of call-sites so that emergency algorithm migrations (e.g. a primitive is broken) and planned migrations (e.g. post-quantum readiness in asymmetric use cases) can be scoped and tracked. Meta explicitly ties this to its earlier post-quantum TLS post: the monitoring dataset "improves our decision-making process while prioritizing quantum-vulnerable use cases." (Source).
  2. Sampling was rejected. The obvious cost mitigation — log 1 in X operations — was explicitly refused: "we felt strongly about not introducing any sampling since doing so would result in most logs being omitted, giving us a less clear picture of the library's usage." Canonical wiki instance of the buffer-and-flush alternative to sampling when the event rate is too high for per-event logging but the full distribution is required (Source).
  3. Buffer-and-flush aggregation is the primitive. Every crypto operation increments a count keyed on the event's identifying fields (key name, method, algorithm — "in practice we log more fields than just key name, method, and algorithm"). On a configurable interval a background thread serialises the accumulated counts to Scribe and clears the map. One row per unique event-tuple per flush, with a count, instead of one row per operation. Meta's illustrative example: the same AES-GCM-SIV encryption under key name myKeyName happening five times aggregates to a single row with count=5. "Machines often compute millions of cryptographic operations per day" — the compression ratio is large (Source; patterns/aggregating-buffered-logger).
  4. Scale context: ~0.05% of Meta CPU is X25519. Meta discloses an operational number that reframes what "widely-distributed library" means in a cryptography context: "we recently disclosed that roughly 0.05% of CPU cycles at Meta are spent on X25519 key exchange." A library this hot cannot afford a Scribe call per operation — the aggregation is architectural necessity, not optimisation (Source; concepts/cryptographic-monitoring).
  5. Aggregation lives client-side, inside FBCrypto. The buffered logger sits on each client host; the FBCrypto public API is unchanged ("the logging does not change the interface of FBCrypto, so all of this is transparent to the clients of the library"). Multithreaded clients share a single buffer per process; a background thread owns the periodic Scribe flush. Correctness under concurrent writes requires the right data structure — Meta uses folly::ConcurrentHashMap from folly, "built to be performant under heavy writes in multithreaded environments, while still guaranteeing atomic accesses" (Source; concepts/unified-library-leverage).
  6. Partially randomised first-flush per host prevents thundercherd writes. A naive fixed-interval flush produces spiky writes when a large number of hosts start together — the whole cohort flushes at the same phase. Meta applies a randomised delay on a per-host basis before the first flush, which distributes the spike into a uniform arrival rate at Scribe: "this leads to a more uniform flushing cadence, allowing for a more consistent load on Scribe." Canonical wiki instance of patterns/jittered-flush-for-write-smoothing applied to library-level telemetry rather than network-level retries (Source).
  7. Derived-key aggregation — logging discipline for KDF-heavy features. FBCrypto's derived crypto feature lets clients derive "child" keysets from a "parent" via a KDF + salt, used by features "that need to generate millions of keys." Initial logging emitted one row per child keyset, exploding buffer size + downstream storage. Fix: aggregate child-key operations under the parent key's name. The aggregation is pessimistic for child-key overuse detection (the parent counter is an upper bound on any child's count), so key-overuse alarms are conservative rather than missing (Source; concepts/derived-key-aggregation).
  8. Storage tiering: Scribe → Scuba (warm) + Hive (cold). Scribe is Meta's canonical log-ingestion framework (cited to Meta's 2019 Scribe post). From Scribe, flushed data lands in Scuba"optimized to be performant for real-time data (i.e., warm storage) and can be inefficient if used for larger datasets" — for interactive analysis, and in Hive tables — cited to Meta's 2010 Hive paper — for longer-term / cold storage. Meta has "occasionally" had to co-manage capacity with the Scribe team as crypto usage growth outruns expected rates (Source).
  9. Flush-on-shutdown is surprisingly tricky. Beyond the periodic timer flush, jobs do one final synchronous flush on shutdown so in-memory counts are not lost. Meta calls out "the nuances of folly::Singleton" — Meta's canonical singleton-lifecycle primitive — because accessing Scribe + its dependencies during shutdown is non-trivial. For Java the analogous rule is "synchronous I/O code and operating quickly" in shutdown hooks. Useful canonical wiki note on the shutdown-environment constraint for any telemetry system that accumulates state in-memory (Source).
  10. Why unified offerings matter here. FBCrypto + Scribe being company-wide means "solutions only have to be implemented once in order for the entire company to benefit." The crypto monitoring system piggybacks on Scribe for ingestion ("most machines in Meta's fleet can log to Scribe") and exploits FBCrypto's wide adoption for coverage ("the wide adoption of FBCrypto gives us insights into cryptographic operations without needing clients to migrate to a new library/API"). Canonical wiki instance of patterns/unified-library-for-fleet-telemetry: the logging-side payoff of a monoculture crypto library is fleet-wide observability for free — "this helps us avoid fragmentation that might require multiple custom solutions to be implemented" (Source).

Architectural numbers + operational notes (from source)

  • CPU cost: "roughly 0.05% of CPU cycles at Meta are spent on X25519 key exchange" — a single asymmetric primitive already measurable at fleet level.
  • Per-host event rate: "machines often compute millions of cryptographic operations per day" — the baseline justifying buffer-and-flush over per-event logging.
  • Logging fidelity: no sampling. This is stated as a first-class design property, not merely a result.
  • Concurrency primitive: folly::ConcurrentHashMap — heavy-write multithreaded environment, atomic accesses.
  • Singleton primitive: folly::Singleton — lifecycle-managed singletons; material on shutdown.
  • Named impact avenues: proactive vulnerability mitigation (pre-PQC inventory + migration prioritisation); infrastructure reliability (success-rate + call-volume as client-health proxy during migrations); library-versioning visibility (real-time view of what binary version each fleet host is running).
  • Named challenges: capacity on Scribe + Scuba (occasional spikes worked through with the Scribe team); shutdown-environment nuances on Meta's canonical singleton library.
  • Future work: further Scribe throughput + Scuba utilisation optimisation; continuing PQC-inventory work; end-to-end crypto latency understanding; unifying the remaining non-FBCrypto cryptographic use-cases to reach full coverage.
  • No absolute flush-interval numbers, no jitter-window numbers, no buffer-size numbers, no row-per-host-per-flush numbers, no Scribe bytes/s numbers disclosed.

Systems / hardware extracted

New wiki pages:

  • systems/fbcrypto — Meta's managed cryptographic library; the telemetry source in this architecture.
  • systems/scribe-meta — Meta's standard logging framework (stub, cited by name); the ingestion substrate.
  • systems/scuba-meta — Meta's warm data store for real-time analytics; short-term storage.
  • systems/meta-hive — Meta's Hive-based data warehouse (cited to the 2010 Hive paper); cold storage.
  • systems/folly-concurrenthashmap — the concurrent-map primitive the buffer is implemented on top of.
  • systems/folly-singleton — Meta's canonical singleton primitive; referenced because shutdown correctness depends on it.

Existing pages reinforced:

  • systems/apache-hive — extended with Meta's Hive usage citation (distinct from the Facebook-origin framing via the 2010 paper). Meta's Hive here is the canonical Meta internal data-warehouse layer.

Concepts + patterns extracted

New concept pages:

  • concepts/cryptographic-monitoring — the umbrella primitive: persistent logs of every cryptographic operation fleet-wide, driving inventory + overuse-detection + migration prioritisation.
  • concepts/telemetry-buffer-and-flush — aggregate-in-memory + periodic-flush as a volume-reduction technique that preserves full-population fidelity (no sampling).
  • concepts/derived-key-aggregation — count KDF-derived child-key operations against the parent-keyset name to cut cardinality with conservative overuse-detection semantics.
  • concepts/key-overuse-detection — the operational primitive the full monitoring stack exists to feed: observe cumulative operations-per-key, rotate before the per-key data-volume budget is exhausted.
  • concepts/unified-library-leverage — the strategic payoff of a monoculture core library: a one-time implementation change (here: adding telemetry) provides coverage + consistency across the whole company without per-client migration.

New pattern pages:

Existing concept extended:

  • concepts/post-quantum-cryptography — extended with Meta's 2024 cryptographic-monitoring post as a fleet-inventory producer feeding PQC migration prioritisation. This is a distinct framing from Cloudflare's deployment-roadmap framing and GitHub's per-protocol rollout framing: the prerequisite for migration is knowing where classical asymmetric primitives are in use — answered by the Meta monitoring dataset at industrial scale.

Caveats

  • Announcement-voice post, not an academic paper or SIGCOMM-style retrospective. High architectural substance (aggregation strategy, data-structure choice, jittered flush, derived-key aggregation, shutdown constraint) but no numeric flush-cadence / buffer-size / volume-reduction-ratio figures.
  • No performance overhead number disclosed. "Negligible impact on compute performance across Meta's fleet" is claimed but not quantified per-operation or per-host.
  • No list of what exactly is in the event tuple. "In practice we log more fields than just key name, method, and algorithm" — the full schema is not disclosed.
  • No Scribe ingest rate number, no Scuba-row-count number, no Hive-retention-window number.
  • Unification is incomplete. "There are other cryptographic use cases across Meta that use a different set of tools for telemetry" — FBCrypto coverage is not 100 % of Meta's cryptographic footprint; full unification is explicitly named as future work.
  • Heartbeat of child-key sensitivity is pessimistic. The parent-keyset aggregation is a safe upper bound for per-child overuse but hides the per-child rate distribution. Whether this gives up too much detection precision in some threat models isn't discussed.

Cross-wiki context

  • Meta axis. This is the ninth first-party Meta Engineering ingest on the wiki (after Presto 2023-07-16 via HS, LLM-training 2024-06-12, MLow 2024-06-13, AI-capacity-maintenance 2024-06-16, DCPerf + RoCE SIGCOMM 2024-08-05, RCA 2024-08-23, PAI 2024-08-31). It opens a security-infrastructure axis on the Meta page — privacy-enforcement (2024-08-31 PAI) and crypto-observability (this post) are now the two wiki-canonical Meta-security framings.
  • Cryptography axis. Canonical wiki complement to the PQC-deployment posts (GitHub + Cloudflare + Google): Meta's post is the inventory prerequisite that makes prioritisation of those rollouts tractable. No direct contradiction — the PQC threat-model pages describe what must migrate; this post describes how you know where to migrate at hyperscale.
  • Observability axis. Complements concepts/observability's log/metric/trace framing with a specific cardinality-management primitive (aggregating buffered logger) that the log vs metric debate's "too-high cardinality for metrics, too-high volume for logs" middle-ground case lands in.
  • Fleet-telemetry axis. Complements the Figma Response Sampling (sources/2026-04-21-figma-visibility-at-scale-sensitive-data-exposure) post — Figma samples HTTP responses for PII detection because full-population logging of bodies is infeasible; Meta aggregates instead of samples because the aggregation key has low cardinality relative to event volume. Two different cost/fidelity resolutions for the same conceptual problem.
  • Unified-library axis. Canonical Meta-wiki instance of the "build primitives once at the company level and let the whole fleet benefit" pattern — patterns/unified-library-for-fleet-telemetry is the telemetry-side crystallisation, in the same shape as Meta's Scribe-for-all / Tupperware-for-all / FBCrypto-for-all strategic posture across the Meta corpus.

Source

Last updated · 319 distilled / 1,201 read