Skip to content

CONCEPT Cited by 1 source

Temporal profiling

Temporal profiling is continuous, wall-clock-timestamped CPU sampling retained long enough that, after a rare event occurs, you can "time-travel" back to the exact profile window and see what the machine was doing. It's the instrument class for sporadic performance pathologies that a random one-shot perf run has near-zero probability of catching.

Why ordinary profiling fails for sporadic events

Standard profiling practice — run perf record on demand for 30-60 seconds when you notice a slowdown — works for steady-state contention. It fails when:

  • The performance event fires once every several hours.
  • The event is non-deterministic across hosts and workload shapes (Pinterest's 3-month investigation saw resets early in some training runs, hours in on others, not at all on many).
  • The event self-terminates quickly (ENA reset: <1 ms) and you have no chance to start a profiler before the window closes.
  • Symptom metrics don't surface the event in real time; you only learn it happened from a log-line / kernel message review next day.

(Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks)

The time-travel mechanic

The pattern is:

  1. Sample continuously at modest frequency (Pinterest used -F 97, ~97 Hz) with per-CPU + per-stack granularity.
  2. Chunk into fixed-size windows with timestamped filenames (Pinterest: 2-minute perf record windows, hostname + timestamp in the path; 360 windows ≈ 12 h of coverage).
  3. Let it run during normal load until the event fires.
  4. Align the event timestamp (from kernel log, app log, alarm) with the nearest profile window and load only that window into the analysis tool.
  5. Zoom within the window to the second-level sub-region around the event.

Flamescope as the canonical visualiser

Netflix's Flamescope is the tool of choice because it lets you see a 120-second heatmap of CPU activity (X axis = time, Y axis = core) and drag-select a sub-window that becomes the regular flamegraph. That's the time-travel primitive in concrete UI form. Pinterest's ENA-reset-at-70 s-into-a-120 s-window investigation zoomed into a 5 s sub-window that showed kubelet spiking to 6.5% of total CPU on mem_cgroup_nr_lru_pages — evidence invisible to any aggregate view.

Trade-offs vs continuous-profiling platforms

Temporal profiling with ad-hoc bash loops around perf record is a cheap, portable fallback when you don't have a continuous- profiling platform (gProfiler, Parca, Pyroscope, Strobelight) rolled out fleet-wide. It lacks cross-host aggregation and symbolisation niceties but works on any Linux host with perf. Pinterest ran this setup on a handful of reserved hosts (patterns/reserved-host-repro-env); they were concurrently rolling out gProfiler with Intel as the long-term replacement.

Continuous-profiling platforms give you temporal profiling by default — the storage, indexing, and UI are already built to slice by time. Ad-hoc perf is the lean-in option for teams not there yet.

Seen in

  • sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks — canonical time-travel use case. 12-hour continuous perf record on reserved K8s-tainted hosts + post-hoc timestamp alignment with ENA reset events + Flamescope zoom was the instrument chain that surfaced kubelet's mem_cgroup_nr_lru_pages CPU spike after three months of symptom-level debugging.
Last updated · 550 distilled / 1,221 read