Skip to content

SYSTEM Cited by 1 source

Strobelight (Meta)

Strobelight is Meta's fleet-wide profiling orchestrator — not a single profiler but a scheduler + coordinator + symbolization frontend over 42 different profilers (at the time of the 2025-01-21 Meta Engineering post), many of them built on eBPF. It runs on every production host at Meta, provides a CLI + web UI for on-demand profiling, and accepts continuous / triggered profile configurations via Configerator. Partially open-sourced at github.com/facebookincubator/strobelight.

Canonical system shape

  • Orchestrator, not a profiler. Strobelight connects resource usage to source code; it schedules and coordinates profilers rather than being a profiler itself. Canonical wiki instance of the profiler-orchestrator pattern.
  • 42 profilers (and growing), covering:
    • Memory profilers powered by systems/jemalloc.
    • Function call-count profilers.
    • Event-based profilers (both native and non-native: Python, Java, Erlang).
    • AI / GPU profilers.
    • Off-CPU-time profilers.
    • Service request-latency profilers.
  • Three execution modes:
    • On-demand — engineers invoke via CLI or web UI; data visible in Scuba within seconds.
    • Continuous — default curated profilers run automatically on every host at tuned intervals/rates.
    • Triggered — profilers kick in on defined conditions.
  • Ad-hoc profilers via bpftrace scripts — engineers can ship a new profiler in hours rather than weeks, by committing a bpftrace script and telling Strobelight to run it like any other profiler. Canonical patterns/ad-hoc-bpftrace-profiler instance.
  • Dynamic sampling rate tuning — config specifies desired samples/hour per service (example: 40,000); Strobelight tunes per-service run probability daily to hit the target. Each sample's weight is recorded so aggregation across hosts + across services is mathematically valid.
  • Default continuous profiling — flight-recorder posture: always-on curated profilers so data is already there when an incident or efficiency question opens.
  • Safety + concurrency rules:
    • PMU counter coordination — only one CPU-cycles profiler at a time per host.
    • Profiler queue to serialise work.
    • DB-write rate controls protect the retention budget of downstream stores.
    • Operators can still force-hammer machines for heavy debugging.

Load-bearing outputs

Default profilers worth calling out

  • LBR profiler — samples Intel Last Branch Records. Data is not visualised directly; it feeds Meta's FDO pipeline. FDO profiles drive compile-time (CSSPGO) and post-compile-time (BOLT) binary optimisations. Meta's top 200 largest services all have continuous-LBR-fed FDO profiles. Some see "up to 20% reduction in CPU cycles"10-20% fewer servers needed to run those services.
  • Event profiler — Strobelight's version of the Linux perf tool. Collects user + kernel stack traces on multiple perf events (CPU cycles, L3 misses, instructions, …). Output drives both interactive flame-graph review and automated regression-detection (pre-prod).
  • Crochet profiler — combines request spans + CPU-cycles stacks + off-CPU data on a single timeline; consumed in the Tracery UI.

Stack enrichment

  • Stack Schemas — DSL (inspired by Microsoft's stack tags) that adds tags to whole stacks or individual frames and regex-strips frames the viewer doesn't care about. Any number of schemas apply per service or per profile.
  • Strobemeta — thread-local-storage mechanism to attach runtime metadata (request IDs, endpoint names, latency buckets, …) to call stacks at sample time via eBPF. Makes request-context-aware profiling possible — e.g. "stacks for p99 latency requests only" — without post-hoc join-to-other-telemetry.

Output surfaces

  • Scuba — the primary data + UI surface; flame graphs, pie charts, time-series, distributions, free-form query.
  • Tracery — trace-timeline tool; client-side columnar DB in JavaScript for responsive zoom + filter on large samples; consumed for the Crochet profiler among others.

Symbolization service

Operational numbers

  • 42 profilers orchestrated.
  • Top 200 services served continuous LBR → FDO → binary-optimisation.
  • Up to 20% CPU-cycles reduction per optimised service.
  • ~15,000 servers/year saved by one-character & fix ("The Biggest Ampersand") on a hot-path std::vector copy in an ads service — an instance of Scuba-query-driven performance triage enabled by the symbolized-file-and-line data Strobelight captures.

Open-source status

"We're currently working on open-sourcing Strobelight's profilers and libraries." Incubator org at github.com/facebookincubator/strobelight. Several supporting libraries are already open-source: systems/bpftrace, systems/blazesym, systems/jemalloc, and BOLT.

Seen in

Last updated · 319 distilled / 1,201 read