Skip to content

PATTERN Cited by 1 source

Sixty-second performance checklist

Pattern

When responding to a Linux host's performance issue, run a fixed, known-order, known-cost sequence of 10 stock shell commands in the first 60 seconds before reaching for deeper tools. The checklist covers all four major resource classes (CPU / memory / disk / network) across utilisation, saturation, errors using only /proc-backed commands that ship with every Linux distribution + the sysstat package.

Netflix's encoding (Brendan Gregg, Performance Engineering):

uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top

Why a fixed list

Ten commands in a defined order, memorised, reduces the first- response latency of every operator to "start typing immediately." You're not deciding which tool to reach for; you're running the sequence and reading the outputs. The discipline is the same as pre-flight checklists: repeatable, complete, fast.

Each command answers specific USE-Method cells:

Command Covers
uptime load average (demand)
dmesg \| tail kernel errors (USE errors)
vmstat 1 CPU saturation (r), mem/swap, CPU util
mpstat -P ALL 1 per-CPU breakdown (single-hot-CPU patterns)
pidstat 1 per-process CPU/mem (rolling, copy-pasteable)
iostat -xz 1 disk util/saturation/errors
free -m memory (page cache + reclaimable)
sar -n DEV 1 NIC util (rxkB/s, txkB/s, ifutil)
sar -n TCP,ETCP 1 TCP connections, retransmits
top catch-all sanity check, variability detector

Order matters

Errors and saturation first, utilisation last. Errors have sharp thresholds (is there a retransmit? an OOM kill?); saturation has directional thresholds (r > CPU count, avgqu-sz > 1); utilisation is a gradient that depends on context. Netflix's ordering puts dmesg second for exactly this reason — "Don't miss this step! dmesg is always worth checking."

Why these tools specifically

  • All read /proc counters — effectively zero-cost to run even on a saturated host. No syscall tracing, no ptrace, no kernel patches.
  • Universally available — every Linux distribution ships top, vmstat, uptime, dmesg, free. The sysstat package (one apt install / yum install) adds sar / iostat / mpstat / pidstat.
  • Each has well-known interpretation rulesr > CPU count, %util > 60%, %sys > 20%, non-zero %steal, near-zero cache. These rules are the minimum vocabulary every operator needs.

Handoff to deeper tools

The checklist is step 0, not root cause. Its output narrows the search space to one or two resources; deeper analysis continues with:

  • eBPF / bpftrace for kernel-layer tracing.
  • Strobelight / perf / flame graphs for CPU-hot-path analysis.
  • Atlas (or Prometheus, Datadog, etc.) for fleet-wide / historical view.
  • Application-specific tooling (JVM flight recorder, Go pprof, Python py-spy) for the language-runtime layer.

The original Netflix post explicitly frames this: "See Brendan's Linux Performance Tools tutorial from Velocity 2015, which works through over 40 commands, covering observability, benchmarking, tuning, static performance tuning, profiling, and tracing."

When to use it

  • Before paging an SRE escalation. It takes 60 seconds; do it.
  • When SSH-ing onto a host for the first time in an incident. This is the "shape of the host" triage.
  • In runbooks — encode it as a named runbook step, not tribal knowledge.
  • In interview training / SRE onboarding — the discipline is teachable and transfers across environments.

When it's insufficient

  • Fleet-wide issues — a single host's checklist tells you nothing about a cross-host correlation; pivot to Atlas / Prometheus.
  • Kernel-internal issues — lock contention, NUMA effects, scheduler pathologies may not surface in the 10 commands.
  • Container-internal issues%iowait / %steal / cgroup throttling need augmenting observations (see systems/netflix-runq-monitor).
  • Networking at the fabric layersar -n DEV 1 sees this host's NIC; fabric-level issues (RoCE, ECMP, micro-bursts) need network-side tooling.

Seen in

  • sources/2025-07-29-netflix-linux-performance-analysis-in-60-seconds — canonical instance. Brendan Gregg's 10-command sequence codifies the pattern. Worked examples from Netflix-era Titus production hosts show each command's interpretation in context — load average 30 on a 32-CPU box resolved to user-CPU-bound via vmstat; dmesg catching an oom-killer
  • TCP SYN flood; two Java processes at 1591% CPU in pidstat.
Last updated · 319 distilled / 1,201 read