META 2025-03-07 Tier 1

Meta — Strobelight: A profiling service built on open source technology¶

Summary¶

A 2025-01-21 Meta Engineering post (Production Engineering) describing Strobelight, Meta's fleet-wide profiling orchestrator — not a single profiler but a coordinator of 42 different profilers, many built on eBPF, running on every production host at Meta. The post makes four load-bearing architectural claims: (1) profiling is orchestrated, not monolithic — Strobelight is a scheduler + coordinator that runs curated default profilers continuously (the "flight recorder" model) while also accepting on-demand and triggered profiles from engineers via a CLI / web UI and a continuous / triggered config committed to Configerator; (2) engineers can ship new profilers in hours via bpftrace scripts, rather than the multi-week code-change + review cycle a new in-tree profiler would require — canonical ad-hoc-bpftrace-profiler instance; (3) dynamic sampling rate tuning keeps per-service sample counts on a configured target (e.g. 40,000 CPU-cycles samples/hour) by adjusting run probability daily per service, and sample weights are recorded so counts normalise across hosts + services when aggregated — this is what makes fleet-level cross-service efficiency analysis tractable; (4) symbolization is a service, not a per-host step — raw addresses + frame-pointer-unwound stacks are sent to a central symbolization service using DWARF + ELF + gsym + blazesym internally, pre-populated from all Meta production binaries. Capacity impact: up to 20% CPU-cycle reduction (10-20% fewer servers) on Meta's top 200 services from a single continuous profiler feeding the FDO pipeline (BOLT post-compile + CSSPGO compile-time); a one-character & C++ code review "ampersand" fix — spotted by filtering std::vector call-sites in Scuba — saved an estimated 15,000 servers per year on a single ads service. Open-source status: Meta is "currently working on open-sourcing Strobelight's profilers and libraries" at github.com/facebookincubator/strobelight.

Key takeaways¶

Strobelight is a profiling orchestrator, not a profiler. "Strobelight... is also not a single profiler but an orchestrator of many different profilers (even ad-hoc ones) that runs on all production hosts at Meta." As of post time, 42 different profilers are coordinated including memory (jemalloc-backed), function call count, event-based (perf-events for C++/Python/Java/Erlang), AI/GPU, off-CPU, service-request-latency. This framing makes Strobelight the canonical wiki instance of profiler-orchestrator — a scheduler + coordinator + queuer + symbolization-frontend that sits above the per-profiler tools. (Source; systems/strobelight)
eBPF is the load-bearing kernel primitive. "Strobelight's profilers are often, but not exclusively, built using eBPF... eBPF allows the safe injection of custom code into the kernel, which enables very low overhead collection of different types of data and unlocks so many possibilities in the observability space that it's hard to imagine how Strobelight would work without it." Canonical Meta instance of eBPF-based profiling — different in framing from the eBPF-for-security (Datadog Workload Protection) and eBPF-for-networking (Fly.io Sprites, Cloudflare DDoS) instances already on the wiki. (Source; systems/ebpf)
Ad-hoc profilers via bpftrace — hours, not weeks. "Adding a new profiler from scratch to Strobelight involves several code changes and could take several weeks to get reviewed and rolled out. However, engineers can write a single bpftrace script... and tell Strobelight to run it like it would any other profiler. An engineer that really cares about the latency of a particular C++ function, for example, could write up a little bpftrace script, commit it, and have Strobelight run it on any number of hosts throughout Meta's fleet – all within a matter of hours, if needed." Canonical wiki instance of ad-hoc-bpftrace-profiler — a small DSL becomes the escape hatch that makes a centralised orchestrator feel uncentralised. (Source; systems/bpftrace)
Default continuous profiling — the "flight recorder" model. "One of Strobelight's core principles has been to provide automatic, regularly-collected profiling data for all of Meta's services. It's like a flight recorder – something that doesn't have to be thought about until it's needed. What's worse than waking up to an alert that a service is unhealthy and there is no data as to why?" A handful of curated profilers run automatically on every Meta host — not continuously, but at run intervals + sampling rates tuned to the workload. Canonical wiki instance of default-continuous-profiling — different in posture from on-demand profiling cultures: Meta pays the upfront cost so the data is always there. (Source)
Dynamic sampling rate tuning — feedback control on run probability. Config expresses desired samples per hour (e.g. 40,000); Strobelight knows how many hosts the service runs on but not how CPU-intensive it is, so it starts conservative and tunes run probability daily per service to hit the target. Sample weights are recorded on emission so data can be aggregated / compared across hosts (where sampling rates differ) and across services (ditto). "Even if Strobelight is profiling Soft Server less often on one host than on another, the samples can be accurately compared and grouped. This also works for comparing two different services." Canonical wiki instance of dynamic sampling rate tuning + weight-based normalisation — the mechanism that makes cross-fleet "horizontal wins" in shared libraries analytically feasible. (Source)
LBR profiler → FDO → BOLT + CSSPGO — 10-20% server reduction at top-200 scale. The Last Branch Record profiler samples Intel LBRs and feeds Meta's feedback-directed optimization (FDO) pipeline. FDO profiles are consumed at compile time (CSSPGO) and post-compile time (BOLT, Meta's open-source binary optimizer). "Meta's top 200 largest services all have FDO profiles from the LBR data gathered continuously across the fleet. Some of these services see up to 20% reduction in CPU cycles, which equates to a 10-20% reduction in the number of servers needed to run these services at Meta." Canonical wiki instance of feedback-directed-optimization-fleet-pipeline: raw-profile-data → FDO profiles → compile/post-compile binary optimisation → production deploy → closed loop with more profiling. The economic case for Strobelight is carried by this one pipeline. (Source; systems/meta-bolt-binary-optimizer)
Event profiler — Meta's internal perf. "This is Strobelight's version of the Linux perf tool. Its primary job is to collect user and kernel stack traces from multiple performance (perf) events e.g., CPU-cycles, L3 cache misses, instructions, etc." Data is both user-facing (flame-graphs for hottest functions + call paths) and fed into monitoring and testing tools to identify regressions ideally before they hit production. Shift-left on performance regressions. (Source)
Stack Schemas — per-stack DSL for enrichment + filtering. Inspired by Microsoft's stack tags. A small DSL operating on call stacks that can add tags to entire call stacks or individual frames, and remove functions users don't care about via regex. Any number of schemas can be applied per-service or per-profile to customise visualisations. Canonical wiki instance of stack-tag enrichment. Used for dashboards that surface expensive copying, use of inefficient or inappropriate C++ containers, overuse of smart pointers, etc. (Source)
Strobemeta — runtime metadata attached to stacks via TLS. "Strobemeta... utilizes thread local storage, to attach bits of dynamic metadata at runtime to call stacks that we gather in the event profiler." Meta notes this is "one of the biggest advantages of building profilers using eBPF: complex and customized actions taken at sample time." Collected Strobemeta is used to attribute call stacks to specific service endpoints, request latency metrics, or request identifiers. Canonical wiki instance of runtime-metadata-attach — the step past call-graph-only profiling into request-context-aware profiling. (Source)
Delayed symbolization via a central service. Symbolization = turn instruction virtual addresses into function name + file + line + type info. DWARF debug info "can be many megabytes (or even gigabytes)" and parsing on-host "is far too computationally expensive. Even with optimal caching strategies it can cause memory issues for the host's workloads." Solution: (a) delay symbolization until after profile collection, store raw addresses on disk to keep the eBPF producer decoupled from the user-space consumer (dropped samples if consumer can't keep up → bad); (b) run a central symbolization service using blazesym + gsym + DWARF + ELF, pre-populated from all Meta production binaries, answering symbolization requests from every profiler instance fleet-wide. Canonical wiki instance of delayed symbolization service + delayed-symbolization. (Source)
Frame pointers everywhere. "All of this is made possible with the inclusion of frame pointers in all of Meta's user space binaries, otherwise we couldn't walk the stack to get all these addresses (or we'd have to do some other complicated/expensive thing which wouldn't be as efficient)." Canonical wiki echo of Brendan Gregg's "The Return of the Frame Pointers" — Meta pays the 1-2% register-pressure tax on every binary to make fleet-wide profiling feasible. frame-pointer-unwinding. (Source)
Scuba is the primary output substrate; Tracery is the secondary timeline UI. "The primary tool Strobelight customers use is Scuba – a query language (like SQL), database, and UI." On-demand profile data is visualisable in Scuba "a few seconds" after collection with flame graphs, pie charts, time series, distributions. Tracery is the trace-timeline UI for the Crochet profiler which combines service request spans + CPU-cycles stacks + off-CPU data into one timeline. Powered by a client-side columnar database written in JavaScript for fast zoom + filter. (Source)
The Biggest Ampersand — one-character fix, ~15,000 servers/year. "A seasoned performance engineer was looking through Strobelight data and discovered that by filtering on a particular std::vector function call (using the symbolized file and line number) he could identify computationally expensive array copies that happen unintentionally with the 'auto' keyword in C++." Spotted one such copy on a hot path in a large Meta ads service; added & after auto to make it a reference; "one-character commit... equated to an estimated 15,000 servers in capacity savings per year!" Canonical operational example of the horizontal win dynamic enabled by per-file-line symbolization + fleet-wide profiling + normalised sample weights. (Source)
Safety is engineered-in. Profilers can "cause performance degradation for the targeted workloads and retention issues for the databases Strobelight writes to." Named safeguards: PMU counter coordination (only one CPU-cycles profiler at a time), profiler queuing system + concurrency rules, rate controls on DB writes. Owners can still "really hammer their machines if they want to extract a lot of data to debug." Trust-the-operator escape hatch. (Source)

Architectural numbers + operational notes (from source)¶

42 different profilers orchestrated at post time.
Top 200 largest Meta services have continuous-profile-backed FDO profiles.
Up to 20% CPU-cycles reduction (10-20% fewer servers) on top-200 services from LBR + FDO + BOLT + CSSPGO.
~15,000 servers/year saved by the one-character & ampersand fix on a large ads service.
Sampling rates are per-service configurable (e.g. 40,000 samples/hour); Strobelight daily-tunes run probability to hit the target.

Config example (from the post):

add_continuous_override_for_offcpu_data(
    "my_awesome_team",
    Type.SERVICE_ID,
    "my_awesome_service",
    30_000,  // desired samples per hour
)

Frame pointers enabled on all Meta user-space binaries — Meta's buy-in to Brendan Gregg's 2024 framing.
Open-sourcing effort underway: github.com/facebookincubator/strobelight.
No total sample-rate-across-fleet number disclosed.
No symbolization-service QPS / latency disclosed.
No Scuba row-count / retention numbers.
No breakdown of the 42 profilers by category.

Systems / hardware extracted¶

New wiki pages:

systems/strobelight — Meta's profiling orchestrator. The canonical system page for the post.
systems/bpftrace — open-source eBPF scripting language; Strobelight's ad-hoc-profiler substrate.
systems/jemalloc — Meta-originated malloc implementation; the memory-profiler backend in Strobelight.
systems/meta-bolt-binary-optimizer — Meta's open-source binary optimiser; post-compile FDO consumer.
systems/tracery-meta — Meta's internal trace-timeline visualisation tool (stub).
systems/meta-configerator — Meta's configuration management system (stub); Strobelight uses it for continuous/triggered profile config.
systems/blazesym — Meta's open-source multi-language symbolization library.
systems/gsym — compact DWARF-derived symbolization format Meta uses in the symbolization service.

Existing pages reinforced:

systems/ebpf — extended with the Strobelight Seen-in citation. eBPF-for-profiling becomes the third fleet-wide-production use-case on the wiki alongside eBPF-for-security (Datadog Workload Protection FIM) and eBPF-for-networking (Cloudflare DDoS / Fly.io Sprites).
systems/scuba-meta — extended with Strobelight as a producer. Now the canonical warm-tier substrate for both cryptographic monitoring (FBCrypto) and profiling (Strobelight) at Meta — two different upstream shapes feeding the same warm-query surface.

Concepts + patterns extracted¶

New concept pages:

concepts/ebpf-profiling — profiling via eBPF kernel-hook programs; contrast with perf / ptrace / DTrace.
concepts/dwarf-debug-info — why DWARF is rich but too heavy to parse on-host; motivates concepts/delayed-symbolization.
concepts/dynamic-sampling-rate-tuning — feedback-control loop adjusting run-probability daily per service to hit a samples-per-hour target, with sample weights for cross-host + cross-service aggregation.
concepts/delayed-symbolization — send raw addresses now, symbolize later (off-host), so the producer can't drop samples and the host's workloads aren't squeezed.
concepts/frame-pointer-unwinding — stack-walk via frame pointers; the cheap well-known unwinding primitive that requires the binary compiler to emit frame pointers.
concepts/ad-hoc-profiler — user-authored profiler, typically via a DSL, vs the first-class in-tree profiler the platform ships. Hours vs weeks deploy latency.
concepts/stack-tag-enrichment — regex-based tagging / removal of call-stack frames at query time, operating over raw captured stacks.
concepts/runtime-metadata-attach — attach dynamic context (request ID, endpoint, latency bucket) to samples at sample time via thread-local storage; read back alongside the stack.

Existing concept extended:

concepts/flamegraph-profiling — existing page previously cited Fly.io rust-proxy case; extended with Meta/Strobelight as the canonical hyperscale producer of flamegraph-ready data (tens of petabytes of raw samples per day implied but not disclosed).
concepts/stack-trace-sampling-profiling — existing page cited Cloudflare/Pingora; extended with Meta/Strobelight as the fleet-orchestrated instantiation, including the dynamic-sampling-rate + weight-normalisation machinery that makes cross-service comparison valid.

New pattern pages:

patterns/profiler-orchestrator — centralised scheduler + queuer + symbolization frontend over many profilers; first-class config, concurrency rules, rate controls, on-demand + continuous + triggered modes.
patterns/ad-hoc-bpftrace-profiler — let engineers ship small eBPF scripts through the orchestrator's pipeline instead of requiring full code-change reviews.
patterns/delayed-symbolization-service — raw samples to disk first, central service resolves addresses → functions + file + line + type info via pre-indexed DWARF + ELF + gsym + blazesym.
patterns/default-continuous-profiling — run curated profilers on every host continuously-but-sparsely, tuned to not perturb workloads, so data is always there when an incident opens.
patterns/feedback-directed-optimization-fleet-pipeline — fleet profiling → FDO profiles → compile-time + post-compile-time binary optimisation → production re-profiling. Closed-loop. Pays for the whole profiling platform at top-200 scale.

Caveats¶

Announcement-voice / feature-overview post, not a SIGCOMM retrospective. High architectural breadth but shallow depth — the 42 profilers are summarised categorically, not enumerated; concurrency-rule mechanics aren't specified; symbolization-service architecture is sketched not diagrammed.
Capacity-savings numbers are for outcomes, not costs. "Up to 20% CPU cycles reduction" and "15,000 servers/year" are headline upside; Strobelight's own CPU / memory / Scribe / Scuba overhead is not disclosed.
Dynamic-sampling math is described qualitatively. "Some very simple math" — the exact adjustment formula + stability properties of the daily re-tune are not specified.
Symbolization-service scale is not quantified — "all production binaries" indexed, but no QPS / latency / cache hit-rate / DB size numbers.
Stack Schemas + Strobemeta are introduced but not deeply specified — no grammar excerpt for Stack Schemas, no schema / wire-format for Strobemeta.
The 42 profilers category breakdown is impressionistic. Memory / function-call-count / event-based (language-specific) / AI-GPU / off-CPU / latency are named — but the distribution, CPU cost, per-service default set, and coverage gaps aren't disclosed.
Open-sourcing status is forward-looking. The GitHub org exists; which components are already public vs pending isn't enumerated in the post.
No direct failure / incident stories. Unlike the 2024-12-02 crypto-monitoring post (which named Scribe capacity headaches + shutdown-environment nuance as challenges), this post is marketing-positive.

Cross-wiki context¶

Meta axis. Thirteenth first-party Meta Engineering ingest on the wiki. Opens a production-engineering axis on the Meta page distinct from the prior Meta axes (GenAI training, hardware, source control, data warehouse, incident RCA, privacy-aware infrastructure, cryptographic monitoring, hyperscale benchmarking, fleet maintenance, storage-tiering, RTC audio, anti-abuse rule engine). Combined with FBCrypto monitoring, the wiki now has two canonical Meta fleet-telemetry architectures: cryptographic-operation aggregation (volume-reduction via event-tuple counting) and profile-sample collection (volume-reduction via statistical sampling + weighted aggregation). Same substrate (Scribe → Scuba → Hive); different upstream shapes.
eBPF axis. The third canonical fleet-wide-production eBPF use-case on the wiki — profiling joins security (Datadog Workload Protection / FIM) and networking (Cloudflare DDoS, Fly.io Sprites NAT, AWS Lambda Geneve NAT). All three leverage the same value proposition: kernel-context visibility without kernel modules, verifier-gated safety, ring-buffer-to-userspace plumbing.
Observability axis. Canonical statement of profiling as a first-class observability pillar alongside logs / metrics / traces — a framing concepts/observability previously lacked. Strobelight's "flight recorder" model — always-on, query-when-needed — is the profiling sibling of the existing metrics-always-on / logs-always-on defaults, a regime that was previously only partially represented on the wiki through the Cloudflare/Pingora stack-trace-sampling page.
Efficiency / cost axis. Canonical "profiling pays for itself at hyperscale" datum. The LBR → FDO → BOLT + CSSPGO → 10-20% server reduction pipeline is the kind of closed-loop whole-fleet optimisation that the wiki previously only gestured at via Cloudflare's one-function-at-a-time Pingora rewrites. Meta's version is systemic rather than case-by-case.
Symbolization axis. First canonical treatment on the wiki of fleet-wide symbolization as an architectural concern. The trade-off — DWARF is rich but heavy; gsym is light but DWARF-derived; blazesym is the open-source runtime — is a small ecosystem the wiki now captures. Complements Fly.io's case-study pattern (patterns/flamegraph-to-upstream-fix) by showing the platform-level investment that makes the case-study feasible at scale.
No existing-claim contradictions — strictly additive.

Source¶

companies/meta — parent company page.
sources/2024-12-02-meta-built-large-scale-cryptographic-monitoring — sibling Meta fleet-telemetry post; aggregating counts of cryptographic operations is to crypto what weighted statistical sampling is to CPU profiling: the volume-reduction primitive that makes full-population visibility affordable. Both ride Scribe → Scuba → Hive.
concepts/observability — Strobelight is the profiling pillar Meta has industrialised at fleet scale.
concepts/flamegraph-profiling — the output format most Strobelight data is consumed through.
systems/ebpf — the load-bearing kernel primitive under most of the 42 profilers.