Skip to content

CONCEPT Cited by 1 source

TMA (top-down microarchitecture analysis)

Definition

Top-down microarchitecture analysis (TMA) is a CPU-level performance-diagnostic methodology that uses hardware performance counters to partition every cycle of CPU execution into one of four mutually-exclusive top-level categories, then drills down hierarchically into each category to pinpoint the specific stall cause. Developed by Intel (Ahmad Yasin, A Top-Down Method for Performance Analysis and Counters Architecture, ISPASS 2014); now widely supported via Linux perf.

The key value: TMA tells you why code is slow at the CPU level, whereas a traditional call-graph profiler only tells you what code is slow.

The four top-level categories

Every retired / stalled issue slot is classified as one of (verbatim from Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):

  • Retiring"The ideal state where the CPU is actively executing and 'retiring' instructions. A high number here is good."
  • Bad speculation"The CPU is executing instructions, but they are ultimately discarded because the CPU incorrectly predicted a branch outcome."
  • Frontend bound"The CPU is stalled waiting for the instruction stream to get decoded, which happens in the CPU frontend. This often occurs in applications that execute a large amount of code but process little data."
  • Backend bound"The CPU is stalled waiting for the backend to execute the decoded instructions. This category has two major subcategories. The first is core-bound, in which it is stalling due to a lack of available execution resources, such as arithmetic logic units. The second is memory-bound. The CPU is waiting for data to be retrieved from memory or the various cache layers."

The four add to 100% of issue slots, so TMA output is a distribution over stall causes, not an opaque number.

Why top-down, not bottom-up

Traditional performance tooling exposes hundreds of individual hardware counters (cache misses, branch mispredictions, TLB misses, port-util, …). The bottom-up approach — measure counters, guess at causes — is error-prone: many individual counter elevations don't actually cause stalls (they overlap with other work). Redpanda's 2026-04-02 framing is the canonical wiki rationale (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):

"TMA uses hardware performance counters exposed by the CPU to measure exactly where a CPU stalls while executing the measured part of the code. It operates top-down, starting at a very high level and only then drilling down into affected areas and CPU components. This avoids getting lost in individual performance counters."

The hierarchy goes: top-level (4 categories) → sub-level (e.g. tma_frontend_bound splits into fetch_latency vs fetch_bandwidth) → leaf (e.g. fetch_latency splits into icache_misses, itlb_misses, branch_resteers, etc.).

The workflow

  1. Sample the top level. Run under perf stat --topdown --td-level 1 (Linux) against the target process:
    $ sudo perf stat --topdown --td-level 1 -t $(pidof -s redpanda)
    
  2. Identify the dominant stall class. If any of bad-speculation / frontend / backend dominates (>25-30%), that's the target.
  3. Drill into that category. Run --td-level 2 to split the dominant category into its subcategories.
  4. Pick the matching optimisation pass.
Dominant category Subcategory Optimisation
Frontend bound i-cache miss / iTLB Code layout, PGO, BOLT, hot-cold splitting
Frontend bound Branch resteer Branch prediction hints, layout
Bad speculation Branch misprediction Profile-driven branch hints, restructure
Backend bound Memory-bound Data layout, prefetch, cache-line packing
Backend bound Core-bound Vectorisation, ILP, FMA
Retiring (high is good) Diminishing returns — target other counters

The Redpanda canonical datum

Redpanda's baseline and PGO-optimized TMA output is the canonical wiki instance (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):

Build Frontend bound Bad speculation Retiring Backend bound
Baseline 51.0% 10.3% 30.9% 7.8%
PGO-optimized 37.9% 9.5% 36.6% 16.0%

Reading: baseline at 51% frontend-bound is "definitely on the higher end, even for database or distributed applications." Verbatim commentary. PGO targets frontend-bound directly via instruction-cache locality transformations; TMA data after PGO shows a 13-point shift out of frontend-bound, 6 points into retiring (good work), and 7 points exposed as backend-bound (the next bottleneck).

Load-bearing observation verbatim: "Some frontend stalls have shifted to backend stalls, which is expected: resolving one bottleneck often reveals the next."

Relationship to other methodologies

  • USE Method (Brendan Gregg) — a system-resource-level checklist (utilisation / saturation / errors per resource). TMA is the drill-in when USE identifies the CPU as the saturated resource.
  • CPU time breakdown (user / kernel / iowait) — coarser OS-level breakdown. TMA is orthogonal — all four TMA categories happen in user-mode.
  • Flamegraph profiling — tells you what function is hot. TMA tells you why the hot function is stalling.

Availability

TMA requires CPU support for the PMU (Performance Monitoring Unit) and appropriate event definitions. Intel publishes the TMA event tree; AMD Zen supports an equivalent; ARM (Neoverse) publishes a compatible hierarchy. Linux perf's --topdown flag works across vendors; specific event names differ.

The Redpanda 2026-04-02 post uses Intel-style event names (tma_frontend_bound etc.), confirming an Intel-microarchitecture test host; TMA wins for clang PGO transfer to AMD / ARM in the same way (same i-cache mechanics) but specific counter names differ.

Seen in

Last updated · 470 distilled / 1,213 read