CONCEPT Cited by 1 source
TMA (top-down microarchitecture analysis)¶
Definition¶
Top-down microarchitecture analysis (TMA) is a CPU-level
performance-diagnostic methodology that uses hardware
performance counters to partition every cycle of CPU execution
into one of four mutually-exclusive top-level categories, then
drills down hierarchically into each category to pinpoint the
specific stall cause. Developed by Intel (Ahmad Yasin, A Top-Down
Method for Performance Analysis and Counters Architecture, ISPASS
2014); now widely supported via Linux perf.
The key value: TMA tells you why code is slow at the CPU level, whereas a traditional call-graph profiler only tells you what code is slow.
The four top-level categories¶
Every retired / stalled issue slot is classified as one of (verbatim from Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):
- Retiring — "The ideal state where the CPU is actively executing and 'retiring' instructions. A high number here is good."
- Bad speculation — "The CPU is executing instructions, but they are ultimately discarded because the CPU incorrectly predicted a branch outcome."
- Frontend bound — "The CPU is stalled waiting for the instruction stream to get decoded, which happens in the CPU frontend. This often occurs in applications that execute a large amount of code but process little data."
- Backend bound — "The CPU is stalled waiting for the backend to execute the decoded instructions. This category has two major subcategories. The first is core-bound, in which it is stalling due to a lack of available execution resources, such as arithmetic logic units. The second is memory-bound. The CPU is waiting for data to be retrieved from memory or the various cache layers."
The four add to 100% of issue slots, so TMA output is a distribution over stall causes, not an opaque number.
Why top-down, not bottom-up¶
Traditional performance tooling exposes hundreds of individual hardware counters (cache misses, branch mispredictions, TLB misses, port-util, …). The bottom-up approach — measure counters, guess at causes — is error-prone: many individual counter elevations don't actually cause stalls (they overlap with other work). Redpanda's 2026-04-02 framing is the canonical wiki rationale (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):
"TMA uses hardware performance counters exposed by the CPU to measure exactly where a CPU stalls while executing the measured part of the code. It operates top-down, starting at a very high level and only then drilling down into affected areas and CPU components. This avoids getting lost in individual performance counters."
The hierarchy goes: top-level (4 categories) → sub-level (e.g.
tma_frontend_bound splits into fetch_latency vs
fetch_bandwidth) → leaf (e.g. fetch_latency splits into
icache_misses, itlb_misses, branch_resteers, etc.).
The workflow¶
- Sample the top level. Run under
perf stat --topdown --td-level 1(Linux) against the target process: - Identify the dominant stall class. If any of bad-speculation / frontend / backend dominates (>25-30%), that's the target.
- Drill into that category. Run
--td-level 2to split the dominant category into its subcategories. - Pick the matching optimisation pass.
| Dominant category | Subcategory | Optimisation |
|---|---|---|
| Frontend bound | i-cache miss / iTLB | Code layout, PGO, BOLT, hot-cold splitting |
| Frontend bound | Branch resteer | Branch prediction hints, layout |
| Bad speculation | Branch misprediction | Profile-driven branch hints, restructure |
| Backend bound | Memory-bound | Data layout, prefetch, cache-line packing |
| Backend bound | Core-bound | Vectorisation, ILP, FMA |
| Retiring | (high is good) | Diminishing returns — target other counters |
The Redpanda canonical datum¶
Redpanda's baseline and PGO-optimized TMA output is the canonical wiki instance (Source: sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization):
| Build | Frontend bound | Bad speculation | Retiring | Backend bound |
|---|---|---|---|---|
| Baseline | 51.0% | 10.3% | 30.9% | 7.8% |
| PGO-optimized | 37.9% | 9.5% | 36.6% | 16.0% |
Reading: baseline at 51% frontend-bound is "definitely on the higher end, even for database or distributed applications." Verbatim commentary. PGO targets frontend-bound directly via instruction-cache locality transformations; TMA data after PGO shows a 13-point shift out of frontend-bound, 6 points into retiring (good work), and 7 points exposed as backend-bound (the next bottleneck).
Load-bearing observation verbatim: "Some frontend stalls have shifted to backend stalls, which is expected: resolving one bottleneck often reveals the next."
Relationship to other methodologies¶
- USE Method (Brendan Gregg) — a system-resource-level checklist (utilisation / saturation / errors per resource). TMA is the drill-in when USE identifies the CPU as the saturated resource.
- CPU time breakdown (user / kernel / iowait) — coarser OS-level breakdown. TMA is orthogonal — all four TMA categories happen in user-mode.
- Flamegraph profiling — tells you what function is hot. TMA tells you why the hot function is stalling.
Availability¶
TMA requires CPU support for the PMU (Performance Monitoring
Unit) and appropriate event definitions. Intel publishes the
TMA event tree; AMD Zen supports an equivalent; ARM (Neoverse)
publishes a compatible hierarchy. Linux perf's --topdown flag
works across vendors; specific event names differ.
The Redpanda 2026-04-02 post uses Intel-style event names
(tma_frontend_bound etc.), confirming an Intel-microarchitecture
test host; TMA wins for clang PGO transfer to AMD / ARM in the
same way (same i-cache mechanics) but specific counter names
differ.
Seen in¶
- sources/2026-04-02-redpanda-supercharging-streaming-with-profile-guided-optimization
— canonical wiki source. Linux
perf --topdown --td-level 1used as the diagnostic that identifies Redpanda Streaming as 51% frontend-bound, justifying PGO as the targeted optimisation.
Related¶
- concepts/frontend-bound-vs-backend-bound-cpu-stall — the TMA classification axis.
- concepts/profile-guided-optimization — the optimisation PGO targets when TMA identifies frontend-bound.
- concepts/instruction-cache-locality — the mechanism PGO improves.
- concepts/cache-locality — the data-cache sibling at the backend-bound altitude.
- concepts/use-method — the complementary resource-level methodology.
- concepts/cpu-utilization-vs-saturation — coarse indicator that TMA is worth running.
- systems/linux-perf — the TMA data collector.
- systems/intel-tma — the canonical TMA reference.
- systems/redpanda — Tier-3 canonical example.
- patterns/tma-guided-optimization-target-selection — the TMA-first-then-targeted-pass methodology.
- patterns/pgo-for-frontend-bound-application — the downstream PGO application when TMA points at frontend-bound.
- patterns/utilization-saturation-errors-triage — the USE method's pattern form.