Skip to content

PATTERN Cited by 1 source

Unified library for fleet telemetry

Pattern. Add telemetry inside a company-wide monoculture library rather than expecting every service to instrument itself. One instrumentation change = fleet-wide observability, without client-side migration, without per-team coordination, without polyculture variance contaminating the dataset.

Canonical wiki instance from Meta's 2024-12-02 cryptographic monitoring post: "Most machines in Meta's fleet can log to Scribe, giving us easy log ingestion support. Furthermore, the wide adoption of FBCrypto gives us insights into cryptographic operations without needing clients to migrate to a new library/API."

Why library-resident telemetry

Three alternatives exist for fleet-wide telemetry of a specific capability (here: cryptography):

Approach Per-client cost Coverage Schema consistency Polyculture risk
Each service instruments itself High (per-service integration work) Partial (long-tail services never integrate) Weak (each team picks fields) High (schema drift per team)
Centralised interceptor / sidecar Medium (deploy sidecar) Medium (sidecar must see the traffic) Medium Medium
Library-resident telemetry (this pattern) Zero (transparent to callers) Full (everyone who links the library is covered) Strong (library team owns schema) Low (one implementation)

Prerequisites

  • A monoculture library — the capability in question has a single dominant implementation across the fleet. Meta's post is explicit that FBCrypto is "used by the majority of our core infrastructure services" and that the team has invested in keeping it so.
  • A unified ingestion substrate — the library's telemetry has somewhere to go that every host can reach. Meta: Scribe. Without this, library-resident telemetry fragments again at the ingestion layer.
  • Stable-API library — library-side changes (adding telemetry, changing algorithms, emitting version info) happen under the same public surface, so callers don't need to migrate.

What goes in the library vs what doesn't

Library-resident telemetry is right for:

  • Capability-specific usage telemetry — per-call counters for the primitives the library implements.
  • Library-version reporting — every binary reports which library version it's running, enabling real-time rollout tracking.
  • Security-relevant invariants — e.g. counters for deprecated algorithm usage, driving migration campaigns.

Library-resident telemetry is wrong for:

  • Business-domain events (user signups, orders) — these belong to the service, not the library.
  • Per-request context — the library doesn't know the application's request structure; include request-scoped context from the caller or emit library telemetry separately.
  • Cross-service correlation / tracing — the library may be called from many request contexts; use trace-propagation substrates, not library-side emission.

Coverage is not 100 %

Real fleets include legacy systems, non-standard stacks, and third-party code that don't link the unified library. Meta's post acknowledges: "there are other cryptographic use cases across Meta that use a different set of tools for telemetry and data collection. More non-trivial work is needed to achieve full unification with all use cases." The pattern delivers most of its value at high (90 %+) coverage; at 50 % coverage the dataset is fragmented enough that polyculture-like analysis problems return.

This is why companies investing in unified-library-based observability simultaneously invest in retiring alternatives — the value scales with the monoculture share.

The pattern as strategic posture

This pattern is a consequence of the broader unified-library leverage strategic posture that Meta also applies to logging (Scribe), containers (Tupperware), time (PTP), and so on. Fleet telemetry is one highly visible payoff of that posture; deprecation- campaign feasibility and fleet-wide security-fix shipping are the sibling payoffs.

Seen in

Last updated · 319 distilled / 1,201 read