Skip to content

PATTERN Cited by 2 sources

Feedback-directed optimization fleet pipeline

Context

A hyperscale operator runs large C++ (or other compiled-language) services across thousands or millions of hosts. Static compiler heuristics leave measurable capacity efficiency on the table — frontend-bound workloads exhibit code-layout and inlining decisions that don't match actual execution frequencies. Manually tuning each service is not feasible; the optimisation pipeline must be continuous, automated, and fleet-wide.

Problem

  • Static compiler optimisation is frequency-blind — it doesn't know which branches, callsites, or functions dominate production.
  • Per-service manual profile collection is too expensive at thousands-of-services scale.
  • Profile data goes stale between releases; a once-off profile pass has diminishing returns as the codebase evolves.
  • The win available from FDO is substantial (10-20% CPU at Meta scale) but only if the pipeline runs continuously.

Solution

Build an end-to-end closed loop that:

  1. Continuously samples execution profiles from production binaries.
  2. Aggregates profiles fleet-wide so the highest-traffic code paths are weighted by real traffic.
  3. Feeds profiles to both compile-time (PGO / CSSPGO) and post-link (BOLT) optimisers.
  4. Ships optimised binaries back to the fleet.
  5. Re-measures to close the loop.

Architecture

  Production fleet (thousands of hosts)
         │  (continuous LBR / perf sampling)
  ┌─────────────────────────┐
  │    Fleet profiler       │  e.g. Strobelight
  │  (Linux perf + LBR)     │
  └──────────┬──────────────┘
             │  (aggregated .profdata / .fdata)
  ┌─────────────────────────┐       ┌────────────────────────┐
  │   Build / release       │──────▶│   Compile-time FDO     │
  │   pipeline              │       │   (clang CSSPGO)       │
  └──────────┬──────────────┘       └──────────┬─────────────┘
             │                                 │
             │                                 ▼
             │                     ┌────────────────────────┐
             │                     │   Linked binary        │
             │                     └──────────┬─────────────┘
             │                                │
             │                                ▼
             │                     ┌────────────────────────┐
             │                     │   Post-link FDO        │
             │                     │   (BOLT)               │
             │                     └──────────┬─────────────┘
             │                                │
             ▼                                ▼
  ┌────────────────────────────────────────────────────┐
  │          Optimised binary deployed to fleet        │
  └────────────────────────────────────────────────────┘
                      (close the loop)

Canonical exemplar: Meta

Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology.

Stage Component
Fleet profiler Strobelight (continuous LBR sampling, open-sourced)
Compile-time FDO CSSPGO (Context-Sensitive Sample-based PGO) in clang / LLVM
Post-link FDO BOLT
Deployment Meta's internal build + release tooling

Measured win: "up to 20% reduction in CPU cycles, which equates to a 10-20% reduction in the number of servers needed to run these services at Meta" on the top-200 largest services. This is the economic datum that pays for Strobelight as a platform — profiling is not a cost centre when the savings are directly measurable in server count.

When this pattern fits

Requirements for the fleet-scale FDO pipeline to be worth building:

  • Fleet size — hundreds+ of servers per service. Below this, per-service manual PGO is cheaper.
  • Codebase size — millions of LoC C++ / Rust / Go / Swift where static heuristics are measurably suboptimal.
  • Continuous-profiling infrastructure — either already in place (observability team) or worth building (see Strobelight's path from ad-hoc to always-on).
  • LLVM expertise — the post-link BOLT step carries brittleness risk (see Redpanda 2026-04-02 case); staffing to debug binary-layout regressions is needed.

When PGO-only suffices

If fleet-scale continuous profiling isn't feasible, the pattern degrades to per-service PGO with a staging-workload training phase:

  • Build pipeline: two-phase clang PGO compile.
  • Training: representative benchmark in staging.
  • Profile cadence: regenerate at each major release.

This is Redpanda's 26.1 approach (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) — one binary, staged workload, instrumented-mode PGO. Gets ~10-15% CPU + 47% p999 latency on the small-batch benchmark; doesn't require the Strobelight-scale continuous-profiling platform.

For the per-binary version without the fleet loop, see patterns/pgo-for-frontend-bound-application.

Composability

The pattern composes with:

Anti-patterns

  • One-shot FDO — collect profile once, rebuild once, done. Profiles go stale; wins erode with each code change.
  • Training workload mismatch — running the PGO training in staging against a non-representative workload. The compiler optimises the wrong paths.
  • Applying BOLT without LLVM expertise — the binary- modification brittleness can bite at the worst time. Redpanda's llvm-project#169899 encounter is the canonical caution.
  • Ignoring build-time cost — 2× compile on a multi-hour build is material. Fleet-scale FDO needs build-infrastructure investment.

Seen in

Last updated · 470 distilled / 1,213 read