CONCEPT Cited by 3 sources

Profile-guided optimization¶

Definition¶

Profile-guided optimization (PGO) is a compiler-level optimisation technique where the compiler consumes profile data collected from a running instrumented binary (or sampling-profiled binary) and uses it to make concrete layout, inlining, branch- prediction-hint, and register-allocation decisions that would otherwise rely on heuristics. It converts the compiler from a guess-what-is-hot optimiser to a know-what-is-hot optimiser.

PGO belongs to the broader feedback-directed optimisation family, which includes post-link binary optimisers like LLVM BOLT.

The two-phase compilation flow¶

Classical PGO (clang -fprofile-generate / -fprofile-use; GCC -fprofile-arcs / -fprofile-use) is a two-phase build:

Phase	What runs	What produces
1. Instrumented build	Compiler inserts counters	Instrumented binary
→ Training	Run representative workload	Profile data file (`.profraw` / `.profdata`)
2. Optimised recompile	Compiler consumes profile	Optimised binary

The training workload between phase 1 and phase 2 is the critical input: PGO's effectiveness is bounded by how well the training profile matches production traffic. Verbatim from Redpanda 2026-04-02: "a representative training workload is run against it to produce profile data. This is then used in a second recompilation to enable better, more targeted optimization" (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization).

Sampling-mode PGO avoids phase 1's instrumentation overhead by running an uninstrumented binary under a statistical profiler (Linux perf + AutoFDO, or LLVM's CSSPGO) and deriving the profile from the samples. Trade-off: zero baseline overhead but probabilistic coverage vs. instrumented-mode's deterministic coverage with runtime cost.

What PGO actually changes¶

The compiler uses profile data to make four classes of decision differently:

Basic-block layout — Frequently-executed basic blocks are packed tightly so the CPU's sequential prefetcher fetches the hot path without branching. Cold paths (error handlers, rarely-taken branches) are moved to separate cache lines.
Hot-cold function splitting — Rarely-executed parts of a function (if (err) goto fail) can be extracted into a separate .text.cold section, improving i-cache density of the hot remainder.
Profile-driven inlining — A function's inlining decision is made based on call-site frequency, not just callee size heuristics. Hot callees inlined aggressively; cold callees kept out-of-line to preserve i-cache budget.
Branch prediction hints — Taken / not-taken likelihood encoded into the instruction stream (__builtin_expect-equivalent) to bias the CPU's dynamic branch predictor.

The Redpanda post names the first three directly (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization): "the compiler identifies which functions and branches are hit most often, then reorganizes code accordingly by grouping hot blocks together and splitting functions into hot and cold segments. Inlining decisions are also profile-driven, allowing frequently called functions to be inlined more aggressively."

Why PGO primarily helps frontend-bound workloads¶

A frontend- bound workload is one where the CPU stalls waiting for instruction fetch + decode, not data. This happens when the hot code path is larger than L1-i-cache (typically 32 KB) or scattered across many pages (triggering iTLB misses). PGO's layout and splitting transformations attack both pathologies directly — smaller hot footprint via cold-code eviction; tighter locality via hot-block packing.

Redpanda's measured TMA data (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) is the canonical wiki datum:

Build	Frontend bound	Bad speculation	Retiring	Backend bound
Baseline	51.0%	10.3%	30.9%	7.8%
PGO-optimized	37.9%	9.5%	36.6%	16.0%

13 percentage points shifted out of frontend-bound; 6 to retiring (useful work), 7 to backend-bound (next bottleneck exposed).

PGO gives little help to backend-bound / memory-bound workloads — those need data-layout work (SoA vs AoS), vectorisation, or cache-line padding. See concepts/cache-locality for the data-side sibling.

Measured wins¶

Redpanda Streaming 26.1 (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization): ~47% lower p999 latency, ~50% lower p50 latency, 15% better CPU reactor utilization on a CPU-intensive small-batch benchmark.
Meta's fleet-wide FDO pipeline (BOLT + CSSPGO): up to 20% CPU-cycle reduction on top-200 services, which equates to 10-20% fewer servers needed (Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology).

Costs and trade-offs¶

Axis	Cost
Build time	~2× (two compilation phases)
Build complexity	Profile-collection pipeline + storage
Profile staleness	New releases need fresh profiles
Binary size	Typically grows (~5-10%) from aggressive hot-path inlining
Code-review friction	Profile-regeneration gate on merges

Mitigated by applying PGO only to binaries where it pays off — hot-path services, large C++ codebases, frontend-bound workloads. Verbatim (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization): "While the compile-time overhead of PGO is a disadvantage, it can be mitigated by enabling PGO only where it's really needed. Granted, PGO is a proven and widely deployed technology."

PGO vs BOLT¶

PGO is compile-time; BOLT is post-link (operates on the already-compiled binary). They're not mutually exclusive:

Property	PGO	BOLT
When	Compile time	After linking
Requires recompilation	Yes	No
Build-time cost	2×	Small (seconds to minutes per binary)
Stability	Proven (decades in production)	Brittle (Redpanda hit `llvm-project#169899`)
Profile format	`.profdata`	Its own format from `perf` / instrumented binaries
Composable	With BOLT on top	With PGO input

Redpanda chose PGO over BOLT for 26.1 (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization) citing stability: "PGO is a proven and widely deployed technology, so with this in mind and considering some outstanding BOLT bugs, we decided to stick with PGO." Meta runs both (via CSSPGO + BOLT in its FDO pipeline).

See concepts/llvm-bolt-post-link-optimizer for BOLT-specific properties.

Historical note¶

PGO was formalised in research in the early 1980s and first widely deployed in Intel's compiler (ICC). Clang's PGO support landed in LLVM 3.0 (2011); GCC's -fprofile-use predates it. Meta's scale of deployment (BOLT paper, CGO 2019) demonstrated fleet-level capacity wins that drove broader industry adoption. Redpanda's 2026-04-02 post represents one of the first canonical Tier-3 vendor disclosures of PGO applied to a streaming-broker binary.

Seen in¶

sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization — canonical wiki source. Redpanda 26.1 deep-dive with TMA before/after + binary-heatmap visualisation + PGO-vs-BOLT trade-off.
sources/2026-03-31-redpanda-261-delivers-the-industrys-first-adaptable-streaming-engine — launch-post disclosure of the 10-15% CPU efficiency win.
sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology — fleet-scale precedent via Strobelight → FDO → BOLT + CSSPGO pipeline.

concepts/llvm-bolt-post-link-optimizer — the post-link variant.
concepts/feedback-directed-optimization — the umbrella family.
concepts/hot-cold-code-splitting — a PGO-enabled transformation.
concepts/instruction-cache-locality — the microarchitectural property PGO optimises.
concepts/frontend-bound-vs-backend-bound-cpu-stall — the TMA axis PGO targets.
concepts/instrumented-vs-sampling-profile — the profile- collection shapes.
concepts/tma-top-down-microarchitecture-analysis — the diagnostic methodology that identifies PGO-addressable workloads.
patterns/pgo-for-frontend-bound-application — the diagnose-then-apply pattern.
patterns/feedback-directed-optimization-fleet-pipeline — Meta's fleet-scale composition.
systems/clang / systems/llvm-bolt / systems/meta-bolt-binary-optimizer — the tooling.
systems/redpanda — Tier-3 canonical deployment.