CONCEPT Cited by 3 sources
Profile-guided optimization¶
Definition¶
Profile-guided optimization (PGO) is a compiler-level optimisation technique where the compiler consumes profile data collected from a running instrumented binary (or sampling-profiled binary) and uses it to make concrete layout, inlining, branch- prediction-hint, and register-allocation decisions that would otherwise rely on heuristics. It converts the compiler from a guess-what-is-hot optimiser to a know-what-is-hot optimiser.
PGO belongs to the broader feedback-directed optimisation family, which includes post-link binary optimisers like LLVM BOLT.
The two-phase compilation flow¶
Classical PGO (clang -fprofile-generate / -fprofile-use; GCC
-fprofile-arcs / -fprofile-use) is a two-phase build:
| Phase | What runs | What produces |
|---|---|---|
| 1. Instrumented build | Compiler inserts counters | Instrumented binary |
| → Training | Run representative workload | Profile data file (.profraw / .profdata) |
| 2. Optimised recompile | Compiler consumes profile | Optimised binary |
The training workload between phase 1 and phase 2 is the critical input: PGO's effectiveness is bounded by how well the training profile matches production traffic. Verbatim from Redpanda 2026-04-02: "a representative training workload is run against it to produce profile data. This is then used in a second recompilation to enable better, more targeted optimization" (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization).
Sampling-mode PGO avoids phase 1's instrumentation overhead by
running an uninstrumented binary under a statistical profiler
(Linux perf + AutoFDO, or LLVM's CSSPGO) and deriving the profile
from the samples. Trade-off: zero baseline overhead but
probabilistic coverage vs. instrumented-mode's deterministic
coverage with runtime cost.
What PGO actually changes¶
The compiler uses profile data to make four classes of decision differently:
- Basic-block layout — Frequently-executed basic blocks are packed tightly so the CPU's sequential prefetcher fetches the hot path without branching. Cold paths (error handlers, rarely-taken branches) are moved to separate cache lines.
- Hot-cold function
splitting — Rarely-executed parts of a function (
if (err) goto fail) can be extracted into a separate.text.coldsection, improving i-cache density of the hot remainder. - Profile-driven inlining — A function's inlining decision is made based on call-site frequency, not just callee size heuristics. Hot callees inlined aggressively; cold callees kept out-of-line to preserve i-cache budget.
- Branch prediction hints — Taken / not-taken likelihood
encoded into the instruction stream (
__builtin_expect-equivalent) to bias the CPU's dynamic branch predictor.
The Redpanda post names the first three directly (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization): "the compiler identifies which functions and branches are hit most often, then reorganizes code accordingly by grouping hot blocks together and splitting functions into hot and cold segments. Inlining decisions are also profile-driven, allowing frequently called functions to be inlined more aggressively."
Why PGO primarily helps frontend-bound workloads¶
A frontend- bound workload is one where the CPU stalls waiting for instruction fetch + decode, not data. This happens when the hot code path is larger than L1-i-cache (typically 32 KB) or scattered across many pages (triggering iTLB misses). PGO's layout and splitting transformations attack both pathologies directly — smaller hot footprint via cold-code eviction; tighter locality via hot-block packing.
Redpanda's measured TMA data (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) is the canonical wiki datum:
| Build | Frontend bound | Bad speculation | Retiring | Backend bound |
|---|---|---|---|---|
| Baseline | 51.0% | 10.3% | 30.9% | 7.8% |
| PGO-optimized | 37.9% | 9.5% | 36.6% | 16.0% |
13 percentage points shifted out of frontend-bound; 6 to retiring (useful work), 7 to backend-bound (next bottleneck exposed).
PGO gives little help to backend-bound / memory-bound workloads — those need data-layout work (SoA vs AoS), vectorisation, or cache-line padding. See concepts/cache-locality for the data-side sibling.
Measured wins¶
- Redpanda Streaming 26.1 (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization): ~47% lower p999 latency, ~50% lower p50 latency, 15% better CPU reactor utilization on a CPU-intensive small-batch benchmark.
- Meta's fleet-wide FDO pipeline (BOLT + CSSPGO): up to 20% CPU-cycle reduction on top-200 services, which equates to 10-20% fewer servers needed (Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology).
Costs and trade-offs¶
| Axis | Cost |
|---|---|
| Build time | ~2× (two compilation phases) |
| Build complexity | Profile-collection pipeline + storage |
| Profile staleness | New releases need fresh profiles |
| Binary size | Typically grows (~5-10%) from aggressive hot-path inlining |
| Code-review friction | Profile-regeneration gate on merges |
Mitigated by applying PGO only to binaries where it pays off — hot-path services, large C++ codebases, frontend-bound workloads. Verbatim (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization): "While the compile-time overhead of PGO is a disadvantage, it can be mitigated by enabling PGO only where it's really needed. Granted, PGO is a proven and widely deployed technology."
PGO vs BOLT¶
PGO is compile-time; BOLT is post-link (operates on the already-compiled binary). They're not mutually exclusive:
| Property | PGO | BOLT |
|---|---|---|
| When | Compile time | After linking |
| Requires recompilation | Yes | No |
| Build-time cost | 2× | Small (seconds to minutes per binary) |
| Stability | Proven (decades in production) | Brittle (Redpanda hit llvm-project#169899) |
| Profile format | .profdata |
Its own format from perf / instrumented binaries |
| Composable | With BOLT on top | With PGO input |
Redpanda chose PGO over BOLT for 26.1 (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) citing stability: "PGO is a proven and widely deployed technology, so with this in mind and considering some outstanding BOLT bugs, we decided to stick with PGO." Meta runs both (via CSSPGO + BOLT in its FDO pipeline).
See concepts/llvm-bolt-post-link-optimizer for BOLT-specific properties.
Historical note¶
PGO was formalised in research in the early 1980s and first widely
deployed in Intel's compiler (ICC). Clang's PGO support landed in
LLVM 3.0 (2011); GCC's -fprofile-use predates it. Meta's
scale of deployment (BOLT
paper, CGO 2019) demonstrated fleet-level capacity wins that drove
broader industry adoption. Redpanda's 2026-04-02 post represents
one of the first canonical Tier-3 vendor disclosures of PGO applied
to a streaming-broker binary.
Seen in¶
- sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization — canonical wiki source. Redpanda 26.1 deep-dive with TMA before/after + binary-heatmap visualisation + PGO-vs-BOLT trade-off.
- sources/2026-03-31-redpanda-261-delivers-the-industrys-first-adaptable-streaming-engine — launch-post disclosure of the 10-15% CPU efficiency win.
- sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology — fleet-scale precedent via Strobelight → FDO → BOLT + CSSPGO pipeline.
Related¶
- concepts/llvm-bolt-post-link-optimizer — the post-link variant.
- concepts/feedback-directed-optimization — the umbrella family.
- concepts/hot-cold-code-splitting — a PGO-enabled transformation.
- concepts/instruction-cache-locality — the microarchitectural property PGO optimises.
- concepts/frontend-bound-vs-backend-bound-cpu-stall — the TMA axis PGO targets.
- concepts/instrumented-vs-sampling-profile — the profile- collection shapes.
- concepts/tma-top-down-microarchitecture-analysis — the diagnostic methodology that identifies PGO-addressable workloads.
- patterns/pgo-for-frontend-bound-application — the diagnose-then-apply pattern.
- patterns/feedback-directed-optimization-fleet-pipeline — Meta's fleet-scale composition.
- systems/clang / systems/llvm-bolt / systems/meta-bolt-binary-optimizer — the tooling.
- systems/redpanda — Tier-3 canonical deployment.