PATTERN Cited by 1 source
Sixty-second performance checklist¶
Pattern¶
When responding to a Linux host's performance issue, run a fixed,
known-order, known-cost sequence of 10 stock shell commands in the
first 60 seconds before reaching for deeper tools. The checklist
covers all four major resource classes (CPU / memory / disk /
network) across utilisation, saturation,
errors using only /proc-backed commands that ship with every
Linux distribution + the sysstat package.
Netflix's encoding (Brendan Gregg, Performance Engineering):
uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
Why a fixed list¶
Ten commands in a defined order, memorised, reduces the first- response latency of every operator to "start typing immediately." You're not deciding which tool to reach for; you're running the sequence and reading the outputs. The discipline is the same as pre-flight checklists: repeatable, complete, fast.
Each command answers specific USE-Method cells:
| Command | Covers |
|---|---|
uptime |
load average (demand) |
dmesg \| tail |
kernel errors (USE errors) |
vmstat 1 |
CPU saturation (r), mem/swap, CPU util |
mpstat -P ALL 1 |
per-CPU breakdown (single-hot-CPU patterns) |
pidstat 1 |
per-process CPU/mem (rolling, copy-pasteable) |
iostat -xz 1 |
disk util/saturation/errors |
free -m |
memory (page cache + reclaimable) |
sar -n DEV 1 |
NIC util (rxkB/s, txkB/s, ifutil) |
sar -n TCP,ETCP 1 |
TCP connections, retransmits |
top |
catch-all sanity check, variability detector |
Order matters¶
Errors and saturation first, utilisation last. Errors have
sharp thresholds (is there a retransmit? an OOM kill?); saturation
has directional thresholds (r > CPU count, avgqu-sz > 1);
utilisation is a gradient that depends on context. Netflix's
ordering puts dmesg second for exactly this reason —
"Don't miss this step! dmesg is always worth checking."
Why these tools specifically¶
- All read
/proccounters — effectively zero-cost to run even on a saturated host. No syscall tracing, no ptrace, no kernel patches. - Universally available — every Linux distribution ships
top,vmstat,uptime,dmesg,free. Thesysstatpackage (oneapt install/yum install) addssar/iostat/mpstat/pidstat. - Each has well-known interpretation rules —
r > CPU count,%util > 60%,%sys > 20%, non-zero%steal, near-zero cache. These rules are the minimum vocabulary every operator needs.
Handoff to deeper tools¶
The checklist is step 0, not root cause. Its output narrows the search space to one or two resources; deeper analysis continues with:
- eBPF / bpftrace for kernel-layer tracing.
- Strobelight / perf / flame graphs for CPU-hot-path analysis.
- Atlas (or Prometheus, Datadog, etc.) for fleet-wide / historical view.
- Application-specific tooling (JVM flight recorder, Go pprof, Python py-spy) for the language-runtime layer.
The original Netflix post explicitly frames this: "See Brendan's Linux Performance Tools tutorial from Velocity 2015, which works through over 40 commands, covering observability, benchmarking, tuning, static performance tuning, profiling, and tracing."
When to use it¶
- Before paging an SRE escalation. It takes 60 seconds; do it.
- When SSH-ing onto a host for the first time in an incident. This is the "shape of the host" triage.
- In runbooks — encode it as a named runbook step, not tribal knowledge.
- In interview training / SRE onboarding — the discipline is teachable and transfers across environments.
When it's insufficient¶
- Fleet-wide issues — a single host's checklist tells you nothing about a cross-host correlation; pivot to Atlas / Prometheus.
- Kernel-internal issues — lock contention, NUMA effects, scheduler pathologies may not surface in the 10 commands.
- Container-internal issues —
%iowait/%steal/ cgroup throttling need augmenting observations (see systems/netflix-runq-monitor). - Networking at the fabric layer —
sar -n DEV 1sees this host's NIC; fabric-level issues (RoCE, ECMP, micro-bursts) need network-side tooling.
Seen in¶
- sources/2025-07-29-netflix-linux-performance-analysis-in-60-seconds
— canonical instance. Brendan Gregg's 10-command sequence
codifies the pattern. Worked examples from Netflix-era Titus
production hosts show each command's interpretation in
context — load average 30 on a 32-CPU box resolved to
user-CPU-bound via
vmstat;dmesgcatching anoom-killer - TCP SYN flood; two Java processes at
1591%CPU inpidstat.
Related¶
- patterns/utilization-saturation-errors-triage — the reusable enumeration discipline underneath this pattern.
- concepts/use-method — the framework it encodes.
- concepts/load-average · concepts/cpu-utilization-vs-saturation · concepts/cpu-time-breakdown · concepts/io-wait · concepts/linux-page-cache
- systems/vmstat · systems/iostat · systems/mpstat · systems/pidstat · systems/sar-sysstat · systems/linux-top · systems/sysstat-package