Skip to content

CONCEPT Cited by 2 sources

Feedback-directed optimization

Definition

Feedback-directed optimization (FDO) is the umbrella family of compiler / binary-optimisation techniques where actual runtime execution data is fed back into the compilation / linking / post-link pipeline to make optimisation decisions that would otherwise rely on static heuristics.

The FDO family includes:

  • Profile-guided optimization (PGO) — compile-time FDO; profile feeds the compiler.
  • BOLT / post-link binary optimisers — post-link FDO; profile feeds a standalone tool that rewrites the linked binary.
  • AutoFDO — sampling-based PGO variant; profile comes from Linux perf on unmodified production binaries.
  • CSSPGO — Context-Sensitive Sample-based PGO, Meta's canonical fleet-scale variant.
  • LBR-based FDO — uses the Last Branch Record CPU feature for zero-overhead branch-frequency data.

FDO is distinguished from traditional optimisation by its information source: measurement, not assumption.

The canonical FDO pipeline

A mature FDO deployment has four stages:

  1. Profile collection — either instrumented or sampling mode. Fleet-wide continuous sampling is the scale-preferred shape (Meta's Strobelight); staging-workload instrumented is the setup-preferred shape (Redpanda's 26.1 approach).
  2. Profile aggregation / validation — merge profiles from many hosts; validate coverage; age-out stale data.
  3. Optimisation pass — consume the profile at compile time (PGO / CSSPGO) or post-link time (BOLT).
  4. Deployment — ship the optimised binary; measure the win; close the loop with fresh profile collection.

For the fleet-scale composition of these stages, see patterns/feedback-directed-optimization-fleet-pipeline.

The pattern of wins

FDO's measured wins across different deployments (rough order of magnitude):

Deployment Measured improvement
Redpanda Streaming 26.1 (C++, PGO, small-batch) 47% p999 latency, 15% CPU reactor util, 10-15% overall efficiency (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization)
Meta fleet (CSSPGO + BOLT, top-200 services) Up to 20% CPU cycles, 10-20% server reduction (Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology)
Generic frontend-bound C++ service 5-15% typical

Wins concentrate on frontend-bound workloads where the hot path has many functions, deep inlining choices, and complex control flow — where static heuristics are weakest and profile data is most valuable.

Why FDO pays for itself

FDO's engineering investment (build-pipeline changes, profile storage, cadence management) is offset by fleet-scale capacity savings:

  • At Meta scale, 10-20% server reduction on the top-200 services is "the economic datum that pays for Strobelight as a platform" (from systems/strobelight overview).
  • At Redpanda-Cloud scale, 15% CPU reactor utilisation improvement directly reduces the number of vCPU-hours billed per cluster — material to Redpanda's cell-based cost model.

FDO fits the offensive performance engineering framing: rather than defending against a specific regression, FDO makes the hot binary systematically faster by extracting information the compiler doesn't have access to by default.

Trade-offs vs traditional optimisation

Axis Static optimisation FDO
Input Source + heuristics Source + heuristics + runtime profile
Build-time cost Baseline 2× (PGO) or baseline + post-link pass (BOLT)
Infra cost None Profile collection + storage
Stability Deterministic from source Profile-dependent
Maintenance None Profile freshness cadence
Typical win 0 (you already run this) 5-20% on hot paths
Coverage Every binary Only profiled binaries

Getting started

A pragmatic FDO adoption path for a C++ codebase:

  1. Pick a single hot-path binary — the one where capacity savings matter most.
  2. Add TMA measurement — Linux perf or equivalent. Confirm the workload is frontend-bound enough to reward FDO. See patterns/tma-guided-optimization-target-selection.
  3. Choose PGO or BOLT — PGO for stability; BOLT for build-time economy and when LLVM expertise is available.
  4. Set up a training workload — a representative production-like benchmark; this is the profile-collection input.
  5. Validate end-to-end — measure the same TMA categories before and after; look for the frontend-bound percentage to drop.
  6. Automate the build — ship the profile-collection → recompile cycle behind a CI flag that can be toggled per-release.

Seen in

Last updated · 470 distilled / 1,213 read