CONCEPT Cited by 1 source
Temporal profiling¶
Temporal profiling is continuous, wall-clock-timestamped CPU
sampling retained long enough that, after a rare event occurs,
you can "time-travel" back to the exact profile window and see what
the machine was doing. It's the instrument class for sporadic
performance pathologies that a random one-shot perf run has
near-zero probability of catching.
Why ordinary profiling fails for sporadic events¶
Standard profiling practice — run perf record on demand for 30-60
seconds when you notice a slowdown — works for steady-state
contention. It fails when:
- The performance event fires once every several hours.
- The event is non-deterministic across hosts and workload shapes (Pinterest's 3-month investigation saw resets early in some training runs, hours in on others, not at all on many).
- The event self-terminates quickly (ENA reset: <1 ms) and you have no chance to start a profiler before the window closes.
- Symptom metrics don't surface the event in real time; you only learn it happened from a log-line / kernel message review next day.
(Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)
The time-travel mechanic¶
The pattern is:
- Sample continuously at modest frequency (Pinterest used
-F 97, ~97 Hz) with per-CPU + per-stack granularity. - Chunk into fixed-size windows with timestamped filenames
(Pinterest: 2-minute
perf recordwindows, hostname + timestamp in the path; 360 windows ≈ 12 h of coverage). - Let it run during normal load until the event fires.
- Align the event timestamp (from kernel log, app log, alarm) with the nearest profile window and load only that window into the analysis tool.
- Zoom within the window to the second-level sub-region around the event.
Flamescope as the canonical visualiser¶
Netflix's Flamescope is the tool of choice
because it lets you see a 120-second heatmap of CPU activity
(X axis = time, Y axis = core) and drag-select a sub-window that
becomes the regular flamegraph. That's the time-travel primitive in
concrete UI form. Pinterest's ENA-reset-at-70 s-into-a-120 s-window
investigation zoomed into a 5 s sub-window that showed kubelet
spiking to 6.5% of total CPU on mem_cgroup_nr_lru_pages — evidence
invisible to any aggregate view.
Trade-offs vs continuous-profiling platforms¶
Temporal profiling with ad-hoc bash loops around perf record is a
cheap, portable fallback when you don't have a continuous-
profiling platform (gProfiler, Parca, Pyroscope, Strobelight) rolled
out fleet-wide. It lacks cross-host aggregation and symbolisation
niceties but works on any Linux host with perf. Pinterest ran
this setup on a handful of reserved hosts (patterns/reserved-host-repro-env);
they were concurrently rolling out gProfiler with Intel as the
long-term replacement.
Continuous-profiling platforms give you temporal profiling by default
— the storage, indexing, and UI are already built to slice by time.
Ad-hoc perf is the lean-in option for teams not there yet.
Seen in¶
- sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks
— canonical time-travel use case. 12-hour continuous
perf recordon reserved K8s-tainted hosts + post-hoc timestamp alignment with ENA reset events + Flamescope zoom was the instrument chain that surfaced kubelet'smem_cgroup_nr_lru_pagesCPU spike after three months of symptom-level debugging.
Related¶
- concepts/flamegraph-profiling — base visualisation
- concepts/stack-trace-sampling-profiling — the sampling primitive
- concepts/per-core-cpu-visibility — the complementary triage axis
- patterns/continuous-perf-record-for-time-travel — the bash-loop implementation recipe
- systems/flamescope — Netflix tool
- systems/linux-perf — Linux sampler