CONCEPT Cited by 2 sources
Feedback-directed optimization¶
Definition¶
Feedback-directed optimization (FDO) is the umbrella family of compiler / binary-optimisation techniques where actual runtime execution data is fed back into the compilation / linking / post-link pipeline to make optimisation decisions that would otherwise rely on static heuristics.
The FDO family includes:
- Profile-guided optimization (PGO) — compile-time FDO; profile feeds the compiler.
- BOLT / post-link binary optimisers — post-link FDO; profile feeds a standalone tool that rewrites the linked binary.
- AutoFDO — sampling-based PGO variant; profile comes from
Linux
perfon unmodified production binaries. - CSSPGO — Context-Sensitive Sample-based PGO, Meta's canonical fleet-scale variant.
- LBR-based FDO — uses the Last Branch Record CPU feature for zero-overhead branch-frequency data.
FDO is distinguished from traditional optimisation by its information source: measurement, not assumption.
The canonical FDO pipeline¶
A mature FDO deployment has four stages:
- Profile collection — either instrumented or sampling mode. Fleet-wide continuous sampling is the scale-preferred shape (Meta's Strobelight); staging-workload instrumented is the setup-preferred shape (Redpanda's 26.1 approach).
- Profile aggregation / validation — merge profiles from many hosts; validate coverage; age-out stale data.
- Optimisation pass — consume the profile at compile time (PGO / CSSPGO) or post-link time (BOLT).
- Deployment — ship the optimised binary; measure the win; close the loop with fresh profile collection.
For the fleet-scale composition of these stages, see patterns/feedback-directed-optimization-fleet-pipeline.
The pattern of wins¶
FDO's measured wins across different deployments (rough order of magnitude):
| Deployment | Measured improvement |
|---|---|
| Redpanda Streaming 26.1 (C++, PGO, small-batch) | 47% p999 latency, 15% CPU reactor util, 10-15% overall efficiency (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) |
| Meta fleet (CSSPGO + BOLT, top-200 services) | Up to 20% CPU cycles, 10-20% server reduction (Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology) |
| Generic frontend-bound C++ service | 5-15% typical |
Wins concentrate on frontend-bound workloads where the hot path has many functions, deep inlining choices, and complex control flow — where static heuristics are weakest and profile data is most valuable.
Why FDO pays for itself¶
FDO's engineering investment (build-pipeline changes, profile storage, cadence management) is offset by fleet-scale capacity savings:
- At Meta scale, 10-20% server reduction on the top-200 services is "the economic datum that pays for Strobelight as a platform" (from systems/strobelight overview).
- At Redpanda-Cloud scale, 15% CPU reactor utilisation improvement directly reduces the number of vCPU-hours billed per cluster — material to Redpanda's cell-based cost model.
FDO fits the offensive performance engineering framing: rather than defending against a specific regression, FDO makes the hot binary systematically faster by extracting information the compiler doesn't have access to by default.
Trade-offs vs traditional optimisation¶
| Axis | Static optimisation | FDO |
|---|---|---|
| Input | Source + heuristics | Source + heuristics + runtime profile |
| Build-time cost | Baseline | 2× (PGO) or baseline + post-link pass (BOLT) |
| Infra cost | None | Profile collection + storage |
| Stability | Deterministic from source | Profile-dependent |
| Maintenance | None | Profile freshness cadence |
| Typical win | 0 (you already run this) | 5-20% on hot paths |
| Coverage | Every binary | Only profiled binaries |
Getting started¶
A pragmatic FDO adoption path for a C++ codebase:
- Pick a single hot-path binary — the one where capacity savings matter most.
- Add TMA measurement — Linux
perfor equivalent. Confirm the workload is frontend-bound enough to reward FDO. See patterns/tma-guided-optimization-target-selection. - Choose PGO or BOLT — PGO for stability; BOLT for build-time economy and when LLVM expertise is available.
- Set up a training workload — a representative production-like benchmark; this is the profile-collection input.
- Validate end-to-end — measure the same TMA categories before and after; look for the frontend-bound percentage to drop.
- Automate the build — ship the profile-collection → recompile cycle behind a CI flag that can be toggled per-release.
Seen in¶
- sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization — Redpanda 26.1 PGO rollout with TMA measurements and explicit PGO-vs-BOLT trade-off analysis.
- sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology — Meta's fleet-scale FDO via Strobelight → CSSPGO + BOLT.
Related¶
- concepts/profile-guided-optimization — the compile-time subfamily.
- concepts/llvm-bolt-post-link-optimizer — the post-link subfamily.
- concepts/hot-cold-code-splitting / concepts/instruction-cache-locality — the mechanisms FDO exploits.
- concepts/instrumented-vs-sampling-profile — the profile- collection shapes.
- concepts/offense-defense-performance-engineering — the broader performance-engineering framing.
- concepts/capacity-efficiency — the economic payoff.
- systems/clang / systems/llvm-bolt / systems/meta-bolt-binary-optimizer / systems/strobelight — the tooling.
- systems/redpanda — Tier-3 canonical example.
- patterns/feedback-directed-optimization-fleet-pipeline — the Meta-scale composition.
- patterns/pgo-for-frontend-bound-application — the diagnose-then-apply pattern.