PATTERN Cited by 2 sources
PGO for frontend-bound application¶
Context¶
A large C++ (or other compiled-language) application with many hot code paths spread across a large binary, exhibiting:
- High proportion of CPU cycles spent in instruction fetch / decode stalls rather than useful work.
- TMA diagnosis: high frontend- bound percentage (>25% is notable; >40% is a flashing red light; 51% is "your hot path is catastrophically scattered").
- Typical triggers: streaming brokers, databases, application servers, interpreters, polymorphic / virtual-call-heavy code, microservice stacks with many small RPC handlers.
The pattern applies whenever instruction-cache locality is the binding constraint — when the compiler's static heuristics for inlining, basic-block layout, and hot-cold partitioning are measurably wrong.
Problem¶
Compiler heuristics assume uniform execution frequency across control-flow paths. Real workloads are heavily skewed — a handful of paths dominate, and the compiler's default layout and inlining choices optimise the wrong ones. Symptoms:
- Hot path sprayed across many functions → i-cache thrashing.
- Cold error-handling blocks inline in hot functions → i-cache capacity wasted.
- Rare functions inlined aggressively → hot path's code footprint bloated.
- Indirect calls defeat sequential prefetching.
Manual hand-tuning doesn't scale — the hot path is too big to reason about function-by-function.
Solution¶
Collect execution profile data; feed it to the compiler (PGO) or post-link optimiser (BOLT); rebuild. Concrete steps:
-
Measure baseline with TMA. Run
perf stat --topdown --td-level 1on the production workload (or a representative benchmark). Confirm frontend-bound is the dominant stall class (see patterns/tma-guided-optimization-target-selection). -
Choose PGO or BOLT based on the team's constraints:
| Property | Choose PGO | Choose BOLT |
|---|---|---|
| Stability-sensitive | ✅ | ❌ (brittle per Redpanda) |
| Large-codebase build-time-sensitive | ❌ (2× compile) | ✅ |
| Fleet-wide continuous profiling available | Either | ✅ |
| Two-phase build pipeline tolerable | ✅ | ❌ |
| LLVM expert on the team | Either | ✅ |
Redpanda chose PGO for 26.1 citing stability (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization). Meta runs both — CSSPGO at compile-time, BOLT post-link (Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology).
-
Set up profile collection. Pick instrumented or sampling. Sampling is fleet-friendly; instrumented is simpler to bootstrap in a staging environment.
-
Run a representative workload against the instrumented (or sampled) binary. Coverage of production distribution matters — if the training workload doesn't hit a hot path, the compiler won't know to optimise it.
-
Rebuild with the profile. Clang:
-fprofile-use=<path>. BOLT:llvm-boltpost-link invocation. -
Re-measure TMA. Confirm frontend-bound percentage dropped. The recovered cycles split between retiring (good work) and the next bottleneck class (expected).
-
Iterate. Close the loop — newer releases need fresh profiles; the pipeline should be continuously fed.
Expected results¶
| Workload class | Typical PGO win |
|---|---|
| C++ streaming broker, small-batch | 10-15% CPU, 47% p999 latency (Redpanda 26.1) |
| C++ fleet service, broad workload | 5-15% CPU; 10-20% fewer servers at Meta top-200 scale |
| Interpreter / VM | 10-30% (very hot-cold-skewed) |
| Microservice stack | 5-10% |
Canonical exemplar: Redpanda 26.1¶
- Baseline TMA: 51% frontend-bound, 30.9% retiring.
- PGO-optimized TMA: 37.9% frontend-bound, 36.6% retiring.
- Wall-clock wins: 47% p999 latency reduction, ~50% p50 latency reduction, 15% CPU reactor utilization reduction.
- Mechanism: hot-block grouping + hot-cold function splitting
- profile-driven inlining, confirmed via BOLT-generated binary heatmap visualisation (hot code packed tightly at start of binary, cold code in separate region).
- Amplification: 15% CPU reduction → ~47% p999 latency reduction via the batching-under-saturation dynamic — shorter broker queue dominates end-to-end latency.
(Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization)
Canonical exemplar: Meta fleet¶
- Binaries: top-200 services (C++ across Meta's monorepo).
- Profile source: fleet-wide continuous sampling via Strobelight LBR data.
- Consumers: CSSPGO at compile time + BOLT post-link.
- Wins: up to 20% CPU cycles = 10-20% fewer servers. At hyperscale, the substrate for "profiling pays for itself."
(Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology)
Anti-patterns to avoid¶
- Applying PGO without measuring first. Without TMA confirmation the workload is frontend-bound, PGO's win may be small or negative (build-time cost without runtime payoff).
- Using a non-representative training workload. PGO's win is bounded by profile-coverage overlap with production.
- Forgetting profile regeneration. Stale profiles from an older release mis-optimise new hot paths.
- Applying BOLT without the LLVM expertise to debug binary-
modification bugs. Redpanda hit
llvm-project#169899and chose PGO instead. - Treating PGO as a one-time gain. FDO works when the loop is continuous — collect, recompile, ship, measure, repeat.
Trade-offs¶
- Build time: PGO ~2× compile; BOLT much cheaper but dependent on a working profile.
- Build complexity: profile-collection pipeline must be maintained.
- Binary size: typically grows 5-10% from aggressive hot-path inlining.
- Debug symbol complexity: hot-cold splitting can confuse debuggers unless tooled for it.
- Stability: BOLT is known-brittle; PGO is stable.
Seen in¶
- sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization — canonical wiki pattern instance. Redpanda Streaming 26.1 PGO rollout with full TMA methodology disclosure + measured wins + PGO-vs-BOLT trade-off.
- sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology — Meta fleet-scale instance via the Strobelight → CSSPGO + BOLT pipeline.
Related¶
- patterns/measurement-driven-micro-optimization — the JVM / JDK-Vector-API altitude sibling.
- patterns/tma-guided-optimization-target-selection — the TMA-first methodology this pattern composes with.
- patterns/feedback-directed-optimization-fleet-pipeline — the fleet-scale composition.
- concepts/profile-guided-optimization / concepts/llvm-bolt-post-link-optimizer — the mechanisms.
- concepts/frontend-bound-vs-backend-bound-cpu-stall / concepts/tma-top-down-microarchitecture-analysis — the diagnostic axis.
- concepts/hot-cold-code-splitting / concepts/instruction-cache-locality — the transformations.
- concepts/feedback-directed-optimization — the umbrella.
- concepts/instrumented-vs-sampling-profile — profile- collection shapes.
- concepts/batching-latency-tradeoff — the amplifier that turns CPU wins into tail-latency wins.
- systems/redpanda / systems/clang / systems/llvm-bolt — the tooling.