PATTERN Cited by 2 sources

Feedback-directed optimization fleet pipeline¶

Context¶

A hyperscale operator runs large C++ (or other compiled-language) services across thousands or millions of hosts. Static compiler heuristics leave measurable capacity efficiency on the table — frontend-bound workloads exhibit code-layout and inlining decisions that don't match actual execution frequencies. Manually tuning each service is not feasible; the optimisation pipeline must be continuous, automated, and fleet-wide.

Problem¶

Static compiler optimisation is frequency-blind — it doesn't know which branches, callsites, or functions dominate production.
Per-service manual profile collection is too expensive at thousands-of-services scale.
Profile data goes stale between releases; a once-off profile pass has diminishing returns as the codebase evolves.
The win available from FDO is substantial (10-20% CPU at Meta scale) but only if the pipeline runs continuously.

Solution¶

Build an end-to-end closed loop that:

Continuously samples execution profiles from production binaries.
Aggregates profiles fleet-wide so the highest-traffic code paths are weighted by real traffic.
Feeds profiles to both compile-time (PGO / CSSPGO) and post-link (BOLT) optimisers.
Ships optimised binaries back to the fleet.
Re-measures to close the loop.

Architecture¶

  Production fleet (thousands of hosts)
         │
         │  (continuous LBR / perf sampling)
         ▼
  ┌─────────────────────────┐
  │    Fleet profiler       │  e.g. Strobelight
  │  (Linux perf + LBR)     │
  └──────────┬──────────────┘
             │
             │  (aggregated .profdata / .fdata)
             ▼
  ┌─────────────────────────┐       ┌────────────────────────┐
  │   Build / release       │──────▶│   Compile-time FDO     │
  │   pipeline              │       │   (clang CSSPGO)       │
  └──────────┬──────────────┘       └──────────┬─────────────┘
             │                                 │
             │                                 ▼
             │                     ┌────────────────────────┐
             │                     │   Linked binary        │
             │                     └──────────┬─────────────┘
             │                                │
             │                                ▼
             │                     ┌────────────────────────┐
             │                     │   Post-link FDO        │
             │                     │   (BOLT)               │
             │                     └──────────┬─────────────┘
             │                                │
             ▼                                ▼
  ┌────────────────────────────────────────────────────┐
  │          Optimised binary deployed to fleet        │
  └────────────────────────────────────────────────────┘
                      (close the loop)

Canonical exemplar: Meta¶

Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology.

Stage	Component
Fleet profiler	Strobelight (continuous LBR sampling, open-sourced)
Compile-time FDO	CSSPGO (Context-Sensitive Sample-based PGO) in clang / LLVM
Post-link FDO	BOLT
Deployment	Meta's internal build + release tooling

Measured win: "up to 20% reduction in CPU cycles, which equates to a 10-20% reduction in the number of servers needed to run these services at Meta" on the top-200 largest services. This is the economic datum that pays for Strobelight as a platform — profiling is not a cost centre when the savings are directly measurable in server count.

When this pattern fits¶

Requirements for the fleet-scale FDO pipeline to be worth building:

Fleet size — hundreds+ of servers per service. Below this, per-service manual PGO is cheaper.
Codebase size — millions of LoC C++ / Rust / Go / Swift where static heuristics are measurably suboptimal.
Continuous-profiling infrastructure — either already in place (observability team) or worth building (see Strobelight's path from ad-hoc to always-on).
LLVM expertise — the post-link BOLT step carries brittleness risk (see Redpanda 2026-04-02 case); staffing to debug binary-layout regressions is needed.

When PGO-only suffices¶

If fleet-scale continuous profiling isn't feasible, the pattern degrades to per-service PGO with a staging-workload training phase:

Build pipeline: two-phase clang PGO compile.
Training: representative benchmark in staging.
Profile cadence: regenerate at each major release.

This is Redpanda's 26.1 approach (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) — one binary, staged workload, instrumented-mode PGO. Gets ~10-15% CPU + 47% p999 latency on the small-batch benchmark; doesn't require the Strobelight-scale continuous-profiling platform.

For the per-binary version without the fleet loop, see patterns/pgo-for-frontend-bound-application.

Composability¶

The pattern composes with:

Capacity efficiency at Meta (concepts/capacity-efficiency) — FDO is the canonical offense-side lever (concepts/offense-defense-performance-engineering).
AI-driven optimisation suggestions — the 2026-04-16 Capacity Efficiency Platform frames FDO as one of many offensive automated loops.
TMA-guided target selection (patterns/tma-guided-optimization-target-selection) — TMA identifies which services are frontend-bound enough to be worth the FDO investment.

Anti-patterns¶

One-shot FDO — collect profile once, rebuild once, done. Profiles go stale; wins erode with each code change.
Training workload mismatch — running the PGO training in staging against a non-representative workload. The compiler optimises the wrong paths.
Applying BOLT without LLVM expertise — the binary- modification brittleness can bite at the worst time. Redpanda's llvm-project#169899 encounter is the canonical caution.
Ignoring build-time cost — 2× compile on a multi-hour build is material. Fleet-scale FDO needs build-infrastructure investment.

Seen in¶

sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology — canonical wiki pattern instance. Meta's Strobelight → CSSPGO + BOLT pipeline; 10-20% CPU reduction on top-200 services.
sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization — per-binary degraded variant; PGO-only (no BOLT, no fleet-wide continuous profiling) at Tier-3 vendor scale. The non-fleet shape of the same underlying pattern.

patterns/pgo-for-frontend-bound-application — the per-binary apply pattern.
patterns/tma-guided-optimization-target-selection — the diagnostic-before-apply methodology.
patterns/measurement-driven-micro-optimization — the runtime-language altitude sibling.
concepts/feedback-directed-optimization — the umbrella family.
concepts/profile-guided-optimization — the compile-time consumer.
concepts/llvm-bolt-post-link-optimizer — the post-link consumer.
concepts/instrumented-vs-sampling-profile — the profile- collection shapes (sampling is canonical for fleet-scale).
concepts/capacity-efficiency — the economic framing.
concepts/offense-defense-performance-engineering — the two-sided performance-engineering framing.
concepts/hot-cold-code-splitting / concepts/instruction-cache-locality — the transformations.
concepts/frontend-bound-vs-backend-bound-cpu-stall — the TMA axis FDO targets.
systems/strobelight / systems/meta-bolt-binary-optimizer / systems/llvm-bolt / systems/clang / systems/linux-perf — the tooling.
systems/redpanda — Tier-3 degraded-variant adopter.
companies/meta — canonical exemplar.