Skip to content

CONCEPT Cited by 3 sources

Profile-guided optimization

Definition

Profile-guided optimization (PGO) is a compiler-level optimisation technique where the compiler consumes profile data collected from a running instrumented binary (or sampling-profiled binary) and uses it to make concrete layout, inlining, branch- prediction-hint, and register-allocation decisions that would otherwise rely on heuristics. It converts the compiler from a guess-what-is-hot optimiser to a know-what-is-hot optimiser.

PGO belongs to the broader feedback-directed optimisation family, which includes post-link binary optimisers like LLVM BOLT.

The two-phase compilation flow

Classical PGO (clang -fprofile-generate / -fprofile-use; GCC -fprofile-arcs / -fprofile-use) is a two-phase build:

Phase What runs What produces
1. Instrumented build Compiler inserts counters Instrumented binary
→ Training Run representative workload Profile data file (.profraw / .profdata)
2. Optimised recompile Compiler consumes profile Optimised binary

The training workload between phase 1 and phase 2 is the critical input: PGO's effectiveness is bounded by how well the training profile matches production traffic. Verbatim from Redpanda 2026-04-02: "a representative training workload is run against it to produce profile data. This is then used in a second recompilation to enable better, more targeted optimization" (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization).

Sampling-mode PGO avoids phase 1's instrumentation overhead by running an uninstrumented binary under a statistical profiler (Linux perf + AutoFDO, or LLVM's CSSPGO) and deriving the profile from the samples. Trade-off: zero baseline overhead but probabilistic coverage vs. instrumented-mode's deterministic coverage with runtime cost.

What PGO actually changes

The compiler uses profile data to make four classes of decision differently:

  1. Basic-block layout — Frequently-executed basic blocks are packed tightly so the CPU's sequential prefetcher fetches the hot path without branching. Cold paths (error handlers, rarely-taken branches) are moved to separate cache lines.
  2. Hot-cold function splitting — Rarely-executed parts of a function (if (err) goto fail) can be extracted into a separate .text.cold section, improving i-cache density of the hot remainder.
  3. Profile-driven inlining — A function's inlining decision is made based on call-site frequency, not just callee size heuristics. Hot callees inlined aggressively; cold callees kept out-of-line to preserve i-cache budget.
  4. Branch prediction hints — Taken / not-taken likelihood encoded into the instruction stream (__builtin_expect-equivalent) to bias the CPU's dynamic branch predictor.

The Redpanda post names the first three directly (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization): "the compiler identifies which functions and branches are hit most often, then reorganizes code accordingly by grouping hot blocks together and splitting functions into hot and cold segments. Inlining decisions are also profile-driven, allowing frequently called functions to be inlined more aggressively."

Why PGO primarily helps frontend-bound workloads

A frontend- bound workload is one where the CPU stalls waiting for instruction fetch + decode, not data. This happens when the hot code path is larger than L1-i-cache (typically 32 KB) or scattered across many pages (triggering iTLB misses). PGO's layout and splitting transformations attack both pathologies directly — smaller hot footprint via cold-code eviction; tighter locality via hot-block packing.

Redpanda's measured TMA data (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) is the canonical wiki datum:

Build Frontend bound Bad speculation Retiring Backend bound
Baseline 51.0% 10.3% 30.9% 7.8%
PGO-optimized 37.9% 9.5% 36.6% 16.0%

13 percentage points shifted out of frontend-bound; 6 to retiring (useful work), 7 to backend-bound (next bottleneck exposed).

PGO gives little help to backend-bound / memory-bound workloads — those need data-layout work (SoA vs AoS), vectorisation, or cache-line padding. See concepts/cache-locality for the data-side sibling.

Measured wins

Costs and trade-offs

Axis Cost
Build time ~2× (two compilation phases)
Build complexity Profile-collection pipeline + storage
Profile staleness New releases need fresh profiles
Binary size Typically grows (~5-10%) from aggressive hot-path inlining
Code-review friction Profile-regeneration gate on merges

Mitigated by applying PGO only to binaries where it pays off — hot-path services, large C++ codebases, frontend-bound workloads. Verbatim (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization): "While the compile-time overhead of PGO is a disadvantage, it can be mitigated by enabling PGO only where it's really needed. Granted, PGO is a proven and widely deployed technology."

PGO vs BOLT

PGO is compile-time; BOLT is post-link (operates on the already-compiled binary). They're not mutually exclusive:

Property PGO BOLT
When Compile time After linking
Requires recompilation Yes No
Build-time cost Small (seconds to minutes per binary)
Stability Proven (decades in production) Brittle (Redpanda hit llvm-project#169899)
Profile format .profdata Its own format from perf / instrumented binaries
Composable With BOLT on top With PGO input

Redpanda chose PGO over BOLT for 26.1 (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) citing stability: "PGO is a proven and widely deployed technology, so with this in mind and considering some outstanding BOLT bugs, we decided to stick with PGO." Meta runs both (via CSSPGO + BOLT in its FDO pipeline).

See concepts/llvm-bolt-post-link-optimizer for BOLT-specific properties.

Historical note

PGO was formalised in research in the early 1980s and first widely deployed in Intel's compiler (ICC). Clang's PGO support landed in LLVM 3.0 (2011); GCC's -fprofile-use predates it. Meta's scale of deployment (BOLT paper, CGO 2019) demonstrated fleet-level capacity wins that drove broader industry adoption. Redpanda's 2026-04-02 post represents one of the first canonical Tier-3 vendor disclosures of PGO applied to a streaming-broker binary.

Seen in

Last updated · 470 distilled / 1,213 read