PATTERN Cited by 1 source

TMA-guided optimization target selection¶

Context¶

A team has a CPU-bound service or binary that needs performance work. The engineering cost of each optimisation pass is high (developer time, stability risk, build-pipeline disruption), so picking the wrong pass — vectorising a frontend-bound workload, reorganising memory layout for a core-bound kernel — wastes effort for zero runtime return.

Problem¶

Traditional profiling (call-graph flame graphs, per-function time) tells you what functions are hot. It does not tell you why those functions are slow at the CPU level. Without that second piece of information:

Optimisation becomes guesswork.
The team may apply passes that don't move the needle.
Worse — the team may apply passes that regress other properties (binary size, stability, maintainability) for no throughput win.

Solution¶

Use top-down microarchitecture analysis (TMA) as the diagnostic step before picking an optimisation pass. TMA partitions CPU cycles into four mutually exclusive categories (retiring / bad-speculation / frontend-bound / backend-bound) via hardware performance counters. The dominant stall category determines the optimisation family.

Steps¶

Collect TMA data. Run:
```
$ sudo perf stat --topdown --td-level 1 -t $(pidof -s <process>)
```
against a representative production workload (or the pre-existing regression benchmark).
Read the distribution. The four categories sum to ~100%. If one category dominates (>25-30%), that's the target.
Map category → optimisation family:

Dominant TMA category	Optimisation family	Canonical tooling
Frontend bound	Code layout	PGO, BOLT, hot-cold splitting
Bad speculation	Branch-prediction improvement	Profile-driven branch hints, restructure predicates
Backend / memory-bound	Data layout / prefetch	SoA / cache-line packing, SW prefetch, loop interchange
Backend / core-bound	ILP / vectorisation	SIMD, FMA, loop unrolling
Retiring	High → diminishing returns	Look for other metrics (scalability, I/O)

Drill down if needed. At --td-level 2, the dominant category splits into sub-causes (e.g. frontend-bound → fetch-latency vs fetch-bandwidth; memory-bound → DRAM vs L3 vs L2). Sub-cause narrows the pass further.
Apply the matching pass. Ship the change.
Re-measure TMA. Two outcomes:
Dominant category decreased + retiring increased → success. Some cycles also move to the next bottleneck category (expected; "resolving one bottleneck often reveals the next" per Redpanda 2026-04-02).
No change in the dominant category → pass didn't work; re-examine.
Iterate. The next TMA run tells you the next pass.

The Redpanda canonical exemplar¶

Redpanda Streaming 2026-04-02 (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):

Step	Result
1. Collect baseline TMA	51% frontend-bound, 30.9% retiring
2. Read	Frontend bound dominates — 50% is "definitely on the higher end"
3. Map	Frontend bound → code-layout optimisation → PGO
4. Drill (skipped — chose pass based on top-level)	—
5. Apply	Clang PGO two-phase compilation
6. Re-measure	37.9% frontend-bound, 36.6% retiring (6 pts to retiring, 7 pts to backend-bound)
7. Iterate	Next pass: address backend-bound stalls

The structural payoff: at step 3, a different diagnosis would have led to a different (wasted) pass. Vectorising a frontend- bound workload doesn't help — the ALUs sit idle waiting for instructions, not data. TMA made the right pass obvious.

Compared to USE method¶

The USE method (Brendan Gregg) is a parallel discipline at the system-resource altitude — utilisation / saturation / errors per resource (CPU, memory, disk, network). TMA-guided optimisation is the drill-in after USE identifies CPU as the saturated resource. The two compose:

USE → CPU is at 95% utilisation → CPU is the target.
TMA → CPU is 51% frontend-bound → code layout is the target.
Apply PGO → measure.

See patterns/utilization-saturation-errors-triage and patterns/sixty-second-performance-checklist for Gregg's complementary patterns.

Anti-patterns¶

"Flamegraph says foo() is hot, so let's vectorise foo()." Without TMA, you don't know why foo() is hot. It might be frontend-bound (vectorisation won't help).
"Our rivals vectorised; we should too." Cargo-culting an optimisation pass is expensive when the pass doesn't match your workload's TMA profile.
"PGO gave Meta a 20% win; let's turn it on." PGO's win is proportional to the frontend-bound percentage of the target workload. A backend-bound workload gets minor or zero PGO benefit.
Measuring once, then optimising forever. Profiles go stale across releases; TMA should be re-run with each major code change.

Trade-offs¶

TMA requires PMU support on the target CPU. Most x86 since 2008
ARM Neoverse + AMD Zen support it; embedded / older chips may not.
TMA event names differ across microarchitectures; scripts need porting.
Sampling TMA under load can perturb the system (small overhead); validate before drawing conclusions on extreme-tail-sensitive workloads.

Seen in¶

sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization — canonical wiki source. perf stat --topdown --td-level 1 output used as the diagnostic that picks PGO as the right pass for Redpanda Streaming's small-batch workload.

patterns/pgo-for-frontend-bound-application — the downstream PGO application when TMA points at frontend-bound.
patterns/measurement-driven-micro-optimization — the JVM / JDK-Vector-API altitude sibling (JMH + flamegraph).
patterns/utilization-saturation-errors-triage — the USE method's pattern form.
patterns/sixty-second-performance-checklist — Gregg's first-60-seconds Linux performance triage.
concepts/tma-top-down-microarchitecture-analysis — the methodology.
concepts/frontend-bound-vs-backend-bound-cpu-stall — the TMA categorisation axis.
concepts/use-method — the system-resource companion.
concepts/profile-guided-optimization — the canonical pass for frontend-bound workloads.
concepts/simd-vectorization — the canonical pass for core-bound workloads.
concepts/cache-locality / concepts/instruction-cache-locality — the data-side and instruction-side locality targets.
systems/linux-perf — the data collector.
systems/intel-tma — the methodology reference.
systems/redpanda — Tier-3 canonical example.