PATTERN Cited by 2 sources

PGO for frontend-bound application¶

Context¶

A large C++ (or other compiled-language) application with many hot code paths spread across a large binary, exhibiting:

High proportion of CPU cycles spent in instruction fetch / decode stalls rather than useful work.
TMA diagnosis: high frontend- bound percentage (>25% is notable; >40% is a flashing red light; 51% is "your hot path is catastrophically scattered").
Typical triggers: streaming brokers, databases, application servers, interpreters, polymorphic / virtual-call-heavy code, microservice stacks with many small RPC handlers.

The pattern applies whenever instruction-cache locality is the binding constraint — when the compiler's static heuristics for inlining, basic-block layout, and hot-cold partitioning are measurably wrong.

Problem¶

Compiler heuristics assume uniform execution frequency across control-flow paths. Real workloads are heavily skewed — a handful of paths dominate, and the compiler's default layout and inlining choices optimise the wrong ones. Symptoms:

Hot path sprayed across many functions → i-cache thrashing.
Cold error-handling blocks inline in hot functions → i-cache capacity wasted.
Rare functions inlined aggressively → hot path's code footprint bloated.
Indirect calls defeat sequential prefetching.

Manual hand-tuning doesn't scale — the hot path is too big to reason about function-by-function.

Solution¶

Collect execution profile data; feed it to the compiler (PGO) or post-link optimiser (BOLT); rebuild. Concrete steps:

Measure baseline with TMA. Run perf stat --topdown --td-level 1 on the production workload (or a representative benchmark). Confirm frontend-bound is the dominant stall class (see patterns/tma-guided-optimization-target-selection).
Choose PGO or BOLT based on the team's constraints:

Property	Choose PGO	Choose BOLT
Stability-sensitive	✅	❌ (brittle per Redpanda)
Large-codebase build-time-sensitive	❌ (2× compile)	✅
Fleet-wide continuous profiling available	Either	✅
Two-phase build pipeline tolerable	✅	❌
LLVM expert on the team	Either	✅

Redpanda chose PGO for 26.1 citing stability (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization). Meta runs both — CSSPGO at compile-time, BOLT post-link (Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology).

Set up profile collection. Pick instrumented or sampling. Sampling is fleet-friendly; instrumented is simpler to bootstrap in a staging environment.
Run a representative workload against the instrumented (or sampled) binary. Coverage of production distribution matters — if the training workload doesn't hit a hot path, the compiler won't know to optimise it.
Rebuild with the profile. Clang: -fprofile-use=<path>. BOLT: llvm-bolt post-link invocation.
Re-measure TMA. Confirm frontend-bound percentage dropped. The recovered cycles split between retiring (good work) and the next bottleneck class (expected).
Iterate. Close the loop — newer releases need fresh profiles; the pipeline should be continuously fed.

Expected results¶

Workload class	Typical PGO win
C++ streaming broker, small-batch	10-15% CPU, 47% p999 latency (Redpanda 26.1)
C++ fleet service, broad workload	5-15% CPU; 10-20% fewer servers at Meta top-200 scale
Interpreter / VM	10-30% (very hot-cold-skewed)
Microservice stack	5-10%

Canonical exemplar: Redpanda 26.1¶

Baseline TMA: 51% frontend-bound, 30.9% retiring.
PGO-optimized TMA: 37.9% frontend-bound, 36.6% retiring.
Wall-clock wins: 47% p999 latency reduction, ~50% p50 latency reduction, 15% CPU reactor utilization reduction.
Mechanism: hot-block grouping + hot-cold function splitting
profile-driven inlining, confirmed via BOLT-generated binary heatmap visualisation (hot code packed tightly at start of binary, cold code in separate region).
Amplification: 15% CPU reduction → ~47% p999 latency reduction via the batching-under-saturation dynamic — shorter broker queue dominates end-to-end latency.

(Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization)

Canonical exemplar: Meta fleet¶

Binaries: top-200 services (C++ across Meta's monorepo).
Profile source: fleet-wide continuous sampling via Strobelight LBR data.
Consumers: CSSPGO at compile time + BOLT post-link.
Wins: up to 20% CPU cycles = 10-20% fewer servers. At hyperscale, the substrate for "profiling pays for itself."

(Source: sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology)

Anti-patterns to avoid¶

Applying PGO without measuring first. Without TMA confirmation the workload is frontend-bound, PGO's win may be small or negative (build-time cost without runtime payoff).
Using a non-representative training workload. PGO's win is bounded by profile-coverage overlap with production.
Forgetting profile regeneration. Stale profiles from an older release mis-optimise new hot paths.
Applying BOLT without the LLVM expertise to debug binary- modification bugs. Redpanda hit llvm-project#169899 and chose PGO instead.
Treating PGO as a one-time gain. FDO works when the loop is continuous — collect, recompile, ship, measure, repeat.

Trade-offs¶

Build time: PGO ~2× compile; BOLT much cheaper but dependent on a working profile.
Build complexity: profile-collection pipeline must be maintained.
Binary size: typically grows 5-10% from aggressive hot-path inlining.
Debug symbol complexity: hot-cold splitting can confuse debuggers unless tooled for it.
Stability: BOLT is known-brittle; PGO is stable.

Seen in¶

sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization — canonical wiki pattern instance. Redpanda Streaming 26.1 PGO rollout with full TMA methodology disclosure + measured wins + PGO-vs-BOLT trade-off.
sources/2025-03-07-meta-strobelight-a-profiling-service-built-on-open-source-technology — Meta fleet-scale instance via the Strobelight → CSSPGO + BOLT pipeline.

patterns/measurement-driven-micro-optimization — the JVM / JDK-Vector-API altitude sibling.
patterns/tma-guided-optimization-target-selection — the TMA-first methodology this pattern composes with.
patterns/feedback-directed-optimization-fleet-pipeline — the fleet-scale composition.
concepts/profile-guided-optimization / concepts/llvm-bolt-post-link-optimizer — the mechanisms.
concepts/frontend-bound-vs-backend-bound-cpu-stall / concepts/tma-top-down-microarchitecture-analysis — the diagnostic axis.
concepts/hot-cold-code-splitting / concepts/instruction-cache-locality — the transformations.
concepts/feedback-directed-optimization — the umbrella.
concepts/instrumented-vs-sampling-profile — profile- collection shapes.
concepts/batching-latency-tradeoff — the amplifier that turns CPU wins into tail-latency wins.
systems/redpanda / systems/clang / systems/llvm-bolt — the tooling.