PATTERN Cited by 1 source
TMA-guided optimization target selection¶
Context¶
A team has a CPU-bound service or binary that needs performance work. The engineering cost of each optimisation pass is high (developer time, stability risk, build-pipeline disruption), so picking the wrong pass — vectorising a frontend-bound workload, reorganising memory layout for a core-bound kernel — wastes effort for zero runtime return.
Problem¶
Traditional profiling (call-graph flame graphs, per-function time) tells you what functions are hot. It does not tell you why those functions are slow at the CPU level. Without that second piece of information:
- Optimisation becomes guesswork.
- The team may apply passes that don't move the needle.
- Worse — the team may apply passes that regress other properties (binary size, stability, maintainability) for no throughput win.
Solution¶
Use top-down microarchitecture analysis (TMA) as the diagnostic step before picking an optimisation pass. TMA partitions CPU cycles into four mutually exclusive categories (retiring / bad-speculation / frontend-bound / backend-bound) via hardware performance counters. The dominant stall category determines the optimisation family.
Steps¶
-
Collect TMA data. Run:
against a representative production workload (or the pre-existing regression benchmark). -
Read the distribution. The four categories sum to ~100%. If one category dominates (>25-30%), that's the target.
-
Map category → optimisation family:
| Dominant TMA category | Optimisation family | Canonical tooling |
|---|---|---|
| Frontend bound | Code layout | PGO, BOLT, hot-cold splitting |
| Bad speculation | Branch-prediction improvement | Profile-driven branch hints, restructure predicates |
| Backend / memory-bound | Data layout / prefetch | SoA / cache-line packing, SW prefetch, loop interchange |
| Backend / core-bound | ILP / vectorisation | SIMD, FMA, loop unrolling |
| Retiring | High → diminishing returns | Look for other metrics (scalability, I/O) |
-
Drill down if needed. At
--td-level 2, the dominant category splits into sub-causes (e.g. frontend-bound → fetch-latency vs fetch-bandwidth; memory-bound → DRAM vs L3 vs L2). Sub-cause narrows the pass further. -
Apply the matching pass. Ship the change.
-
Re-measure TMA. Two outcomes:
- Dominant category decreased + retiring increased → success. Some cycles also move to the next bottleneck category (expected; "resolving one bottleneck often reveals the next" per Redpanda 2026-04-02).
-
No change in the dominant category → pass didn't work; re-examine.
-
Iterate. The next TMA run tells you the next pass.
The Redpanda canonical exemplar¶
Redpanda Streaming 2026-04-02 (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):
| Step | Result |
|---|---|
| 1. Collect baseline TMA | 51% frontend-bound, 30.9% retiring |
| 2. Read | Frontend bound dominates — 50% is "definitely on the higher end" |
| 3. Map | Frontend bound → code-layout optimisation → PGO |
| 4. Drill (skipped — chose pass based on top-level) | — |
| 5. Apply | Clang PGO two-phase compilation |
| 6. Re-measure | 37.9% frontend-bound, 36.6% retiring (6 pts to retiring, 7 pts to backend-bound) |
| 7. Iterate | Next pass: address backend-bound stalls |
The structural payoff: at step 3, a different diagnosis would have led to a different (wasted) pass. Vectorising a frontend- bound workload doesn't help — the ALUs sit idle waiting for instructions, not data. TMA made the right pass obvious.
Compared to USE method¶
The USE method (Brendan Gregg) is a parallel discipline at the system-resource altitude — utilisation / saturation / errors per resource (CPU, memory, disk, network). TMA-guided optimisation is the drill-in after USE identifies CPU as the saturated resource. The two compose:
- USE → CPU is at 95% utilisation → CPU is the target.
- TMA → CPU is 51% frontend-bound → code layout is the target.
- Apply PGO → measure.
See patterns/utilization-saturation-errors-triage and patterns/sixty-second-performance-checklist for Gregg's complementary patterns.
Anti-patterns¶
- "Flamegraph says
foo()is hot, so let's vectorisefoo()." Without TMA, you don't know whyfoo()is hot. It might be frontend-bound (vectorisation won't help). - "Our rivals vectorised; we should too." Cargo-culting an optimisation pass is expensive when the pass doesn't match your workload's TMA profile.
- "PGO gave Meta a 20% win; let's turn it on." PGO's win is proportional to the frontend-bound percentage of the target workload. A backend-bound workload gets minor or zero PGO benefit.
- Measuring once, then optimising forever. Profiles go stale across releases; TMA should be re-run with each major code change.
Trade-offs¶
- TMA requires PMU support on the target CPU. Most x86 since 2008
- ARM Neoverse + AMD Zen support it; embedded / older chips may not.
- TMA event names differ across microarchitectures; scripts need porting.
- Sampling TMA under load can perturb the system (small overhead); validate before drawing conclusions on extreme-tail-sensitive workloads.
Seen in¶
- sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization
— canonical wiki source.
perf stat --topdown --td-level 1output used as the diagnostic that picks PGO as the right pass for Redpanda Streaming's small-batch workload.
Related¶
- patterns/pgo-for-frontend-bound-application — the downstream PGO application when TMA points at frontend-bound.
- patterns/measurement-driven-micro-optimization — the JVM / JDK-Vector-API altitude sibling (JMH + flamegraph).
- patterns/utilization-saturation-errors-triage — the USE method's pattern form.
- patterns/sixty-second-performance-checklist — Gregg's first-60-seconds Linux performance triage.
- concepts/tma-top-down-microarchitecture-analysis — the methodology.
- concepts/frontend-bound-vs-backend-bound-cpu-stall — the TMA categorisation axis.
- concepts/use-method — the system-resource companion.
- concepts/profile-guided-optimization — the canonical pass for frontend-bound workloads.
- concepts/simd-vectorization — the canonical pass for core-bound workloads.
- concepts/cache-locality / concepts/instruction-cache-locality — the data-side and instruction-side locality targets.
- systems/linux-perf — the data collector.
- systems/intel-tma — the methodology reference.
- systems/redpanda — Tier-3 canonical example.