CONCEPT Cited by 1 source
Per-core CPU visibility¶
Per-core CPU visibility is the discipline of watching CPU utilisation core-by-core rather than as a whole-machine aggregate. On hosts with many vCPUs, a single saturated core can cause production-impacting CPU starvation — especially of latency-sensitive kernel threads like network-driver NAPI — while the whole-machine number remains comfortably under-utilised.
The aggregation-bias trap¶
A 96-vCPU GPU host pinned at 100% on one core represents ~1% whole-machine CPU. Any dashboard averaging across cores will show "machine is idle" even as a critical kernel thread is being starved. (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)
Pinterest's 2025 ENA-reset incident investigation stalled for weeks
at the aggregate-perf stage because "an overall perf view told us
very little about what was happening in each individual core."
Breaking out per-core — mpstat -P ALL 1 for a per-second,
per-core %sys / %user / %iowait breakdown — immediately
surfaced core 39 at 100% %sys for multiple seconds, correlated
with the ENA resets, with the rest of the machine quiet.
Canonical triage command¶
# Per-core utilisation, 1-second cadence, all cores
mpstat -P ALL 1
# Tabular history for offline analysis (Pinterest: 1 hour, 1-second)
mpstat -P ALL 1 3600 > mpstat.log
Columns to scan: %usr, %sys, %iowait. A single core at 100%
%sys points at kernel-side consumption —
zombie memcg iteration / softirq
floods / lock contention. 100% %usr points at userspace workload.
When to reach for it¶
- Latency-sensitive kernel thread starvation symptoms — network driver resets (concepts/network-driver-reset), packet drops, missed timer callbacks.
- Noisy-neighbor hypotheses on shared multi-tenant hosts where one workload is degrading another.
- Profile triangulation before committing to an expensive temporal-profiling run — per-core visibility tells you which core to profile.
Complement to temporal profiling¶
Per-core visibility and temporal profiling form a two-step investigation pattern:
- Per-core tells you which core has the spike and approximately when.
- Temporal profiling (continuous
perf record+ Flamescope) tells you what stack is running on that core at that moment.
Pinterest applied them in exactly that order — mpstat revealed core
39 saturated at ENA-reset time, then the continuous-perf-record
setup caught the kubelet / mem_cgroup_nr_lru_pages stack at the
same timestamp.
Seen in¶
- sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks
— canonical case study. 96-vCPU GPU hosts; whole-machine
perfhid the saturation;mpstat -P ALL 1over an hour revealed a single core at 100%%sysfor multiple seconds at ENA-reset times. The step that turned a months-long investigation into a tractable one.
Related¶
- concepts/cpu-starvation-network-driver — the motivating class of incident
- concepts/temporal-profiling — natural next step once a core is identified
- concepts/noisy-neighbor — adjacent class of starvation
- systems/mpstat — canonical tool
- systems/linux-perf — for drill-down into the saturated core