Skip to content

PATTERN Cited by 1 source

TMA-guided optimization target selection

Context

A team has a CPU-bound service or binary that needs performance work. The engineering cost of each optimisation pass is high (developer time, stability risk, build-pipeline disruption), so picking the wrong pass — vectorising a frontend-bound workload, reorganising memory layout for a core-bound kernel — wastes effort for zero runtime return.

Problem

Traditional profiling (call-graph flame graphs, per-function time) tells you what functions are hot. It does not tell you why those functions are slow at the CPU level. Without that second piece of information:

  • Optimisation becomes guesswork.
  • The team may apply passes that don't move the needle.
  • Worse — the team may apply passes that regress other properties (binary size, stability, maintainability) for no throughput win.

Solution

Use top-down microarchitecture analysis (TMA) as the diagnostic step before picking an optimisation pass. TMA partitions CPU cycles into four mutually exclusive categories (retiring / bad-speculation / frontend-bound / backend-bound) via hardware performance counters. The dominant stall category determines the optimisation family.

Steps

  1. Collect TMA data. Run:

    $ sudo perf stat --topdown --td-level 1 -t $(pidof -s <process>)
    
    against a representative production workload (or the pre-existing regression benchmark).

  2. Read the distribution. The four categories sum to ~100%. If one category dominates (>25-30%), that's the target.

  3. Map category → optimisation family:

Dominant TMA category Optimisation family Canonical tooling
Frontend bound Code layout PGO, BOLT, hot-cold splitting
Bad speculation Branch-prediction improvement Profile-driven branch hints, restructure predicates
Backend / memory-bound Data layout / prefetch SoA / cache-line packing, SW prefetch, loop interchange
Backend / core-bound ILP / vectorisation SIMD, FMA, loop unrolling
Retiring High → diminishing returns Look for other metrics (scalability, I/O)
  1. Drill down if needed. At --td-level 2, the dominant category splits into sub-causes (e.g. frontend-bound → fetch-latency vs fetch-bandwidth; memory-bound → DRAM vs L3 vs L2). Sub-cause narrows the pass further.

  2. Apply the matching pass. Ship the change.

  3. Re-measure TMA. Two outcomes:

  4. Dominant category decreased + retiring increased → success. Some cycles also move to the next bottleneck category (expected; "resolving one bottleneck often reveals the next" per Redpanda 2026-04-02).
  5. No change in the dominant category → pass didn't work; re-examine.

  6. Iterate. The next TMA run tells you the next pass.

The Redpanda canonical exemplar

Redpanda Streaming 2026-04-02 (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):

Step Result
1. Collect baseline TMA 51% frontend-bound, 30.9% retiring
2. Read Frontend bound dominates — 50% is "definitely on the higher end"
3. Map Frontend bound → code-layout optimisation → PGO
4. Drill (skipped — chose pass based on top-level)
5. Apply Clang PGO two-phase compilation
6. Re-measure 37.9% frontend-bound, 36.6% retiring (6 pts to retiring, 7 pts to backend-bound)
7. Iterate Next pass: address backend-bound stalls

The structural payoff: at step 3, a different diagnosis would have led to a different (wasted) pass. Vectorising a frontend- bound workload doesn't help — the ALUs sit idle waiting for instructions, not data. TMA made the right pass obvious.

Compared to USE method

The USE method (Brendan Gregg) is a parallel discipline at the system-resource altitude — utilisation / saturation / errors per resource (CPU, memory, disk, network). TMA-guided optimisation is the drill-in after USE identifies CPU as the saturated resource. The two compose:

  1. USE → CPU is at 95% utilisation → CPU is the target.
  2. TMA → CPU is 51% frontend-bound → code layout is the target.
  3. Apply PGO → measure.

See patterns/utilization-saturation-errors-triage and patterns/sixty-second-performance-checklist for Gregg's complementary patterns.

Anti-patterns

  • "Flamegraph says foo() is hot, so let's vectorise foo()." Without TMA, you don't know why foo() is hot. It might be frontend-bound (vectorisation won't help).
  • "Our rivals vectorised; we should too." Cargo-culting an optimisation pass is expensive when the pass doesn't match your workload's TMA profile.
  • "PGO gave Meta a 20% win; let's turn it on." PGO's win is proportional to the frontend-bound percentage of the target workload. A backend-bound workload gets minor or zero PGO benefit.
  • Measuring once, then optimising forever. Profiles go stale across releases; TMA should be re-run with each major code change.

Trade-offs

  • TMA requires PMU support on the target CPU. Most x86 since 2008
  • ARM Neoverse + AMD Zen support it; embedded / older chips may not.
  • TMA event names differ across microarchitectures; scripts need porting.
  • Sampling TMA under load can perturb the system (small overhead); validate before drawing conclusions on extreme-tail-sensitive workloads.

Seen in

Last updated · 470 distilled / 1,213 read