SYSTEM Cited by 1 source
Strobelight (Meta)¶
Strobelight is Meta's fleet-wide profiling orchestrator — not a single profiler but a scheduler + coordinator + symbolization frontend over 42 different profilers (at the time of the 2025-01-21 Meta Engineering post), many of them built on eBPF. It runs on every production host at Meta, provides a CLI + web UI for on-demand profiling, and accepts continuous / triggered profile configurations via Configerator. Partially open-sourced at github.com/facebookincubator/strobelight.
Canonical system shape¶
- Orchestrator, not a profiler. Strobelight connects resource usage to source code; it schedules and coordinates profilers rather than being a profiler itself. Canonical wiki instance of the profiler-orchestrator pattern.
- 42 profilers (and growing), covering:
- Memory profilers powered by systems/jemalloc.
- Function call-count profilers.
- Event-based profilers (both native and non-native: Python, Java, Erlang).
- AI / GPU profilers.
- Off-CPU-time profilers.
- Service request-latency profilers.
- Three execution modes:
- On-demand — engineers invoke via CLI or web UI; data visible in Scuba within seconds.
- Continuous — default curated profilers run automatically on every host at tuned intervals/rates.
- Triggered — profilers kick in on defined conditions.
- Ad-hoc profilers via bpftrace scripts — engineers can ship a new profiler in hours rather than weeks, by committing a bpftrace script and telling Strobelight to run it like any other profiler. Canonical patterns/ad-hoc-bpftrace-profiler instance.
- Dynamic sampling rate tuning — config specifies desired samples/hour per service (example: 40,000); Strobelight tunes per-service run probability daily to hit the target. Each sample's weight is recorded so aggregation across hosts + across services is mathematically valid.
- Default continuous profiling — flight-recorder posture: always-on curated profilers so data is already there when an incident or efficiency question opens.
- Safety + concurrency rules:
- PMU counter coordination — only one CPU-cycles profiler at a time per host.
- Profiler queue to serialise work.
- DB-write rate controls protect the retention budget of downstream stores.
- Operators can still force-hammer machines for heavy debugging.
Load-bearing outputs¶
Default profilers worth calling out¶
- LBR profiler — samples Intel Last Branch Records. Data is not visualised directly; it feeds Meta's FDO pipeline. FDO profiles drive compile-time (CSSPGO) and post-compile-time (BOLT) binary optimisations. Meta's top 200 largest services all have continuous-LBR-fed FDO profiles. Some see "up to 20% reduction in CPU cycles" — 10-20% fewer servers needed to run those services.
- Event profiler — Strobelight's version of the Linux
perftool. Collects user + kernel stack traces on multiple perf events (CPU cycles, L3 misses, instructions, …). Output drives both interactive flame-graph review and automated regression-detection (pre-prod). - Crochet profiler — combines request spans + CPU-cycles stacks + off-CPU data on a single timeline; consumed in the Tracery UI.
Stack enrichment¶
- Stack Schemas — DSL (inspired by Microsoft's stack tags) that adds tags to whole stacks or individual frames and regex-strips frames the viewer doesn't care about. Any number of schemas apply per service or per profile.
- Strobemeta — thread-local-storage mechanism to attach runtime metadata (request IDs, endpoint names, latency buckets, …) to call stacks at sample time via eBPF. Makes request-context-aware profiling possible — e.g. "stacks for p99 latency requests only" — without post-hoc join-to-other-telemetry.
Output surfaces¶
- Scuba — the primary data + UI surface; flame graphs, pie charts, time-series, distributions, free-form query.
- Tracery — trace-timeline tool; client-side columnar DB in JavaScript for responsive zoom + filter on large samples; consumed for the Crochet profiler among others.
Symbolization service¶
- Delayed symbolization service — raw addresses + frame-pointer-unwound stacks are sent to a central service; DWARF / ELF / gsym / blazesym pre-indexed over all Meta production binaries; returns function + file + line + type info (including inlines).
- Frame pointers enabled on all Meta user-space binaries — the platform precondition that makes stack-walk cheap at fleet scale.
Operational numbers¶
- 42 profilers orchestrated.
- Top 200 services served continuous LBR → FDO → binary-optimisation.
- Up to 20% CPU-cycles reduction per optimised service.
- ~15,000 servers/year saved by one-character
&fix ("The Biggest Ampersand") on a hot-pathstd::vectorcopy in an ads service — an instance of Scuba-query-driven performance triage enabled by the symbolized-file-and-line data Strobelight captures.
Open-source status¶
"We're currently working on open-sourcing Strobelight's profilers and libraries." Incubator org at github.com/facebookincubator/strobelight. Several supporting libraries are already open-source: systems/bpftrace, systems/blazesym, systems/jemalloc, and BOLT.
Seen in¶
- sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology — the canonical Meta Engineering introduction to Strobelight (2025-01-21; Production Engineering).
Related¶
- systems/ebpf — the load-bearing kernel primitive.
- systems/bpftrace — the ad-hoc-profiler substrate.
- systems/jemalloc — memory-profiler backend.
- systems/meta-bolt-binary-optimizer — post-compile FDO consumer.
- systems/tracery-meta — secondary visualisation surface.
- systems/meta-configerator — config substrate for continuous / triggered profiling.
- systems/blazesym, systems/gsym — symbolization libraries / format.
- systems/scuba-meta — primary output store + UI.
- patterns/profiler-orchestrator — the canonical pattern Strobelight instantiates.
- patterns/feedback-directed-optimization-fleet-pipeline — the economic engine.
- patterns/ad-hoc-bpftrace-profiler — the velocity multiplier.
- patterns/delayed-symbolization-service — the scale-out symbolization architecture.
- patterns/default-continuous-profiling — the flight-recorder posture.
- concepts/ebpf-profiling, concepts/dynamic-sampling-rate-tuning, concepts/delayed-symbolization, concepts/frame-pointer-unwinding, concepts/ad-hoc-profiler, concepts/stack-tag-enrichment, concepts/runtime-metadata-attach
- companies/meta