Skip to content

PATTERN Cited by 1 source

Continuous perf-record for time-travel

Loop perf record in fixed-size timestamped windows for hours or days, so that when a rare event fires you can load just the window that captured it. The ad-hoc bash implementation of temporal profiling when you don't have a continuous-profiling platform rolled out.

Canonical recipe

The pattern that broke Pinterest's 3-month ENA-reset incident (Source: sources/2026-04-15-pinterest-finding-zombies-in-our-systems-cpu-bottlenecks):

# Loop: 360 iterations * 2 minutes ≈ 12 hours coverage
# Timestamped filenames enable post-hoc event alignment
for i in {1..360}; do
  sudo perf record \
    -F 97 \
    -g \
    -a \
    -o perf-$(hostname)-$(date +"%Y%m%d-%H-%M-%S")-120s.data \
    -- sleep 120
done

# Post-hoc: generate stack text for each window
for datafile in $(ls perf-*); do
  perf script --header -i $datafile > $datafile.stacks
done

Then load the specific window's .stacks file into Flamescope (or stackcollapse-perf.pl | flamegraph.pl for a plain flamegraph) and zoom to the seconds around the event.

Knob-by-knob rationale

  • -F 97 (97 Hz sampling). Prime number near 100 Hz to avoid aliasing with periodic workloads. Low enough to keep overhead tolerable on multi-vCPU hosts; high enough to catch sub-second spikes.
  • -g (call-graph). Without stack traces, flamegraphs are useless. Required.
  • -a (all CPUs). You don't know in advance which core will have the spike; profile the whole machine.
  • 2-minute windows. Chosen to cap individual perf.data file sizes — a 12-hour single file would be unwieldy and would also force you to reprocess all of it for each replay. 2 minutes is small enough to replay quickly, large enough that a single event lands comfortably inside one window.
  • Hostname + timestamp in filename. Required for fleet-wide deployment and for correlation with kernel-log timestamps.
  • 12-hour run. Tuned to Pinterest's typical 8-12 h training job length — you want the incubation window for the rare event to fit inside the profile horizon.

Matching events to windows

The time-travel step:

  1. Find the event timestamp. Pinterest got it from dmesg ENA reset lines; other examples would be alert firing time, OOM kill time, TCP-reset log time.
  2. Pick the window. Filename timestamp gives you the start; the event time tells you how many seconds in. Pinterest saw the reset at "about 70 seconds into this profile" and zoomed Flamescope to a 5 s sub-window around it.
  3. Load and zoom. Flamescope's heatmap UI + drag-to-select is purpose-built for this.

Storage footprint warning

Each 2-minute perf record -F 97 -g -a data file on a 96-vCPU host is typically a few hundred MB. 12 hours × 360 files × multiple hosts fills a disk fast. This pattern is viable on a reserved debug fleet (patterns/reserved-host-repro-env) with a dedicated volume for profile data, not fleet-wide.

If you want fleet-wide continuous profiling, the production-grade replacements are Parca / Pyroscope / gProfiler / Strobelight — they symbolise and deduplicate data server-side so the per-host footprint is manageable. Pinterest was rolling out gProfiler with Intel concurrently for exactly this reason; the bash loop was the ad-hoc bridge.

Failure modes

  • Overhead matters on hot hosts. -F 997 or -F 999 would give finer resolution but also substantially higher CPU overhead — could mask the starvation signal you're hunting.
  • Kernel symbols must be available. Missing /proc/kallsyms or stripped kernel builds give you hex addresses instead of function names in the flamegraph. Validate perf script output early.
  • Disk fills. Watch free space; add find . -name 'perf-*.data' -mmin +720 -delete in parallel if the investigation runs longer than expected.

Seen in

Last updated · 550 distilled / 1,221 read