PATTERN Cited by 2 sources
Feedback-directed optimization fleet pipeline¶
Context¶
A hyperscale operator runs large C++ (or other compiled-language) services across thousands or millions of hosts. Static compiler heuristics leave measurable capacity efficiency on the table — frontend-bound workloads exhibit code-layout and inlining decisions that don't match actual execution frequencies. Manually tuning each service is not feasible; the optimisation pipeline must be continuous, automated, and fleet-wide.
Problem¶
- Static compiler optimisation is frequency-blind — it doesn't know which branches, callsites, or functions dominate production.
- Per-service manual profile collection is too expensive at thousands-of-services scale.
- Profile data goes stale between releases; a once-off profile pass has diminishing returns as the codebase evolves.
- The win available from FDO is substantial (10-20% CPU at Meta scale) but only if the pipeline runs continuously.
Solution¶
Build an end-to-end closed loop that:
- Continuously samples execution profiles from production binaries.
- Aggregates profiles fleet-wide so the highest-traffic code paths are weighted by real traffic.
- Feeds profiles to both compile-time (PGO / CSSPGO) and post-link (BOLT) optimisers.
- Ships optimised binaries back to the fleet.
- Re-measures to close the loop.
Architecture¶
Production fleet (thousands of hosts)
│
│ (continuous LBR / perf sampling)
▼
┌─────────────────────────┐
│ Fleet profiler │ e.g. Strobelight
│ (Linux perf + LBR) │
└──────────┬──────────────┘
│
│ (aggregated .profdata / .fdata)
▼
┌─────────────────────────┐ ┌────────────────────────┐
│ Build / release │──────▶│ Compile-time FDO │
│ pipeline │ │ (clang CSSPGO) │
└──────────┬──────────────┘ └──────────┬─────────────┘
│ │
│ ▼
│ ┌────────────────────────┐
│ │ Linked binary │
│ └──────────┬─────────────┘
│ │
│ ▼
│ ┌────────────────────────┐
│ │ Post-link FDO │
│ │ (BOLT) │
│ └──────────┬─────────────┘
│ │
▼ ▼
┌────────────────────────────────────────────────────┐
│ Optimised binary deployed to fleet │
└────────────────────────────────────────────────────┘
(close the loop)
Canonical exemplar: Meta¶
Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology.
| Stage | Component |
|---|---|
| Fleet profiler | Strobelight (continuous LBR sampling, open-sourced) |
| Compile-time FDO | CSSPGO (Context-Sensitive Sample-based PGO) in clang / LLVM |
| Post-link FDO | BOLT |
| Deployment | Meta's internal build + release tooling |
Measured win: "up to 20% reduction in CPU cycles, which equates to a 10-20% reduction in the number of servers needed to run these services at Meta" on the top-200 largest services. This is the economic datum that pays for Strobelight as a platform — profiling is not a cost centre when the savings are directly measurable in server count.
When this pattern fits¶
Requirements for the fleet-scale FDO pipeline to be worth building:
- Fleet size — hundreds+ of servers per service. Below this, per-service manual PGO is cheaper.
- Codebase size — millions of LoC C++ / Rust / Go / Swift where static heuristics are measurably suboptimal.
- Continuous-profiling infrastructure — either already in place (observability team) or worth building (see Strobelight's path from ad-hoc to always-on).
- LLVM expertise — the post-link BOLT step carries brittleness risk (see Redpanda 2026-04-02 case); staffing to debug binary-layout regressions is needed.
When PGO-only suffices¶
If fleet-scale continuous profiling isn't feasible, the pattern degrades to per-service PGO with a staging-workload training phase:
- Build pipeline: two-phase clang PGO compile.
- Training: representative benchmark in staging.
- Profile cadence: regenerate at each major release.
This is Redpanda's 26.1 approach (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) — one binary, staged workload, instrumented-mode PGO. Gets ~10-15% CPU + 47% p999 latency on the small-batch benchmark; doesn't require the Strobelight-scale continuous-profiling platform.
For the per-binary version without the fleet loop, see patterns/pgo-for-frontend-bound-application.
Composability¶
The pattern composes with:
- Capacity efficiency at Meta (concepts/capacity-efficiency) — FDO is the canonical offense-side lever (concepts/offense-defense-performance-engineering).
- AI-driven optimisation suggestions — the 2026-04-16 Capacity Efficiency Platform frames FDO as one of many offensive automated loops.
- TMA-guided target selection (patterns/tma-guided-optimization-target-selection) — TMA identifies which services are frontend-bound enough to be worth the FDO investment.
Anti-patterns¶
- One-shot FDO — collect profile once, rebuild once, done. Profiles go stale; wins erode with each code change.
- Training workload mismatch — running the PGO training in staging against a non-representative workload. The compiler optimises the wrong paths.
- Applying BOLT without LLVM expertise — the binary-
modification brittleness can bite at the worst time. Redpanda's
llvm-project#169899encounter is the canonical caution. - Ignoring build-time cost — 2× compile on a multi-hour build is material. Fleet-scale FDO needs build-infrastructure investment.
Seen in¶
- sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology — canonical wiki pattern instance. Meta's Strobelight → CSSPGO + BOLT pipeline; 10-20% CPU reduction on top-200 services.
- sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization — per-binary degraded variant; PGO-only (no BOLT, no fleet-wide continuous profiling) at Tier-3 vendor scale. The non-fleet shape of the same underlying pattern.
Related¶
- patterns/pgo-for-frontend-bound-application — the per-binary apply pattern.
- patterns/tma-guided-optimization-target-selection — the diagnostic-before-apply methodology.
- patterns/measurement-driven-micro-optimization — the runtime-language altitude sibling.
- concepts/feedback-directed-optimization — the umbrella family.
- concepts/profile-guided-optimization — the compile-time consumer.
- concepts/llvm-bolt-post-link-optimizer — the post-link consumer.
- concepts/instrumented-vs-sampling-profile — the profile- collection shapes (sampling is canonical for fleet-scale).
- concepts/capacity-efficiency — the economic framing.
- concepts/offense-defense-performance-engineering — the two-sided performance-engineering framing.
- concepts/hot-cold-code-splitting / concepts/instruction-cache-locality — the transformations.
- concepts/frontend-bound-vs-backend-bound-cpu-stall — the TMA axis FDO targets.
- systems/strobelight / systems/meta-bolt-binary-optimizer / systems/llvm-bolt / systems/clang / systems/linux-perf — the tooling.
- systems/redpanda — Tier-3 degraded-variant adopter.
- companies/meta — canonical exemplar.