Skip to content

NETFLIX 2025-07-29

Read original ↗

Netflix — Linux Performance Analysis in 60,000 Milliseconds

Summary

Netflix Performance Engineering (Brendan Gregg + team) publishes a 10-command, 60-second triage checklist for the first minute of any Linux production performance investigation. The intent is explicitly first-response, not root-cause: given an SSH shell on a misbehaving EC2 Linux host, what do you run before reaching for Atlas or on-host tools like eBPF tracing? The ten commands — uptime, dmesg | tail, vmstat 1, mpstat -P ALL 1, pidstat 1, iostat -xz 1, free -m, sar -n DEV 1, sar -n TCP,ETCP 1, top — cover CPU / memory / disk / network utilisation + saturation + errors across the USE Method dimensions with tools that ship on every stock Linux image (the sysstat package supplies sar, iostat, mpstat, pidstat). Per-column interpretation guidance walks through the non-obvious signal — load average including uninterruptible I/O-blocked tasks, user vs system vs iowait vs steal, vmstat's r column as a cleaner CPU-saturation signal than load average, %iowait as a form of CPU idle that points at disk bottlenecks, free -m's -/+ buffers/cache row as the true used/free accounting (the "linuxatemyram" confusion), ZFS-on- Linux's separate cache that free doesn't reflect, and sar -n TCP,ETCP 1 retransmits as a shared signal for network failure + overload. Production examples (from real Titus-era Netflix hosts) show load average 30 with us + sy ≈ 99 confirming CPU saturation, a perl oom-killer + TCP SYN-flood hint in dmesg, two Java processes at 1591% CPU (≈16 cores) in pidstat, 22 MB/s eth0 receive well below a 1 Gbit/s cap in sar -n DEV 1, and 34-process run queue depth on a 32-CPU box. The post is a canonical wiki reference on the operating-system-observability toolbox; a small USE-Method checklist encoded as 10 shell commands.

Key takeaways

  1. The 60-second checklist is a USE-Method triage encoded in 10 commands. Utilisation, saturation, errors across CPU / memory / disk / network — run them in the order given, look for errors and saturation first ("they are both easy to interpret, and then resource utilization"). The commands share substrate: vmstat + sar + sysstat all read /proc counters, making them safe + cheap on a contended host.
  2. Load average is demand, not CPU usage. Three numbers (1 / 5 / 15-minute exponentially damped moving averages) count both runnable-on-CPU tasks and uninterruptible-I/O-blocked tasks — so high load average on Linux can mean disk saturation, not CPU saturation. Three-number trend matters: 1-min ≫ 15-min = rising load; 1-min ≪ 15-min = you may have logged in after the event. "Worth a quick look only" — it's a demand signal, not a utilisation signal.
  3. vmstat 1's r column is the cleaner CPU-saturation primitive. Count of tasks currently running on CPU and waiting for a CPU slot; unlike load average, excludes I/O-blocked tasks. Interpretation rule: "an 'r' value greater than the CPU count is saturation." Canonical instance of concepts/cpu-utilization-vs-saturation — utilisation ("how much of the CPU is busy") and saturation ("how deep is the queue behind the CPU") are two separate measurements.
  4. CPU time breakdown (us / sy / id / wa / st) disambiguates CPU issues. User time points at application logic; system time > 20% is a kernel / I/O- inefficiency hint; wait-I/O is a form of idle that flags disk- bound work; steal time flags hypervisor co-tenancy (Xen dom0 / another guest consuming your scheduled cycles) — the EC2-era concepts/noisy-neighbor signal visible from inside the guest.
  5. %iowait is a disk-bottleneck signal, not a CPU-busy signal. "You can treat wait I/O as another form of CPU idle, one that gives a clue as to why they are idle." Paired with iostat -xz 1's await (service+queue time in ms), avgqu-sz (queued request count — > 1 often means saturation), and %util (busy percent — > 60% hurts performance, ~100% usually means saturation), you get the per-device picture. Caveat: "if the storage device is a logical disk device fronting many back-end disks, then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back-end disks may be far from saturated" — a crucial caveat for LVM / RAID / cloud block storage.
  6. free -m's -/+ buffers/cache row is the load-bearing output. Linux uses free memory for the page cache + buffer cache but reclaims it on demand; the top free column undercounts available memory. The -/+ buffers/cache row gives the "less confusing" used / free pair. Near-zero buffers or cached → disk I/O will climb → check with iostat. ZFS-on-Linux (which Netflix uses for some services) has its own cache that free doesn't reflect — the system can look low on memory when memory is actually available via the ARC.
  7. pidstat 1 is top's rolling-log cousin. Same per-process %CPU / %MEM breakdown as top, but prints one line per interval instead of clearing the screen — you can copy-paste it into an incident record. Real example: two Java processes at 1591% + 1583% CPU means each is consuming ~16 of 32 cores — usually one hot flow on a many-core box.
  8. sar -n TCP,ETCP 1 names the three core TCP signals. active/s (locally-initiated connections) as a downstream-call rate, passive/s (remotely-accepted connections) as an inbound-load rate, retrans/s as a joint network-or-server- overload signal — "it may be an unreliable network (e.g., the public Internet), or it may be due a server being overloaded and dropping packets." Shared-cause ambiguity is inherent to the metric.
  9. dmesg | tail before anything else. Kernel messages capture OOM-kills, TCP SYN-flood drops, hardware errors, driver complaints. "Don't miss this step! dmesg is always worth checking." Example from the post: perl invoked oom-killer killing a 1.9 GB perl process + TCP: Possible SYN flooding on port 7001. Dropping request. — both of which immediately reframe the investigation.
  10. The 60-second checklist is a precondition for deep analysis. Gregg's Linux Performance Tools at Velocity 2015 covers 40+ tools across observability, benchmarking, tuning, profiling, tracing; this 10-command checklist is step 0 — it narrows the search space to one resource class (CPU / mem / disk / network) before you reach for perf / ftrace / eBPF / flamegraphs.

Systems / tools extracted

  • systems/vmstat — BSD-vintage (1980s) virtual-memory statistics tool; prints per-interval summary of run queue / swap / I/O / system / CPU columns from /proc/stat + /proc/meminfo. r column is the CPU-saturation primitive. First line of output is since-boot average — skip it.
  • systems/iostat — Part of the sysstat package. iostat -xz 1 = extended per-device stats + hide zero-activity devices + 1 s intervals. Columns: r/s / w/s / rkB/s / wkB/s / await / avgqu-sz / %util.
  • systems/mpstat — Per-CPU %usr / %sys / %iowait / %irq / %soft / %steal / %guest / %idle breakdown; exposes single- hot-CPU patterns that whole-CPU averages hide.
  • systems/pidstat — Per-process CPU / memory / I/O / context- switch sampling; rolling output instead of clear-screen like top.
  • systems/sar-sysstat — System Activity Reporter; -n DEV for network interface throughput + packet rate; -n TCP,ETCP for TCP counters (active/passive connections, retransmits, segment errors).
  • systems/linux-top — The canonical interactive process snapshot; aggregates many signals into one screen but hides temporal patterns (Ctrl-S / Ctrl-Q to pause / resume).
  • systems/sysstat-package — Umbrella package providing sar / sadc / iostat / mpstat / pidstat"some of these commands require the sysstat package installed." Ships by default on most distro base images; explicit install on minimal / container images.
  • systems/netflix-atlas — Netflix's cloud-wide telemetry platform; the post explicitly names Atlas as the fleet-scale observability companion to on-host command-line triage.
  • systems/netflix-titus — Netflix's container platform; implicitly the substrate these commands run on inside a Titus container's cgroup.

Concepts / patterns canonicalised

  • concepts/use-method — Brendan Gregg's Utilisation / Saturation / Errors framework for resource-bottleneck triage; walk every resource (CPU, memory, disk, network) and check all three dimensions. The 60-second checklist is a USE-Method instantiation in 10 shell commands.
  • concepts/load-average — Exponentially damped 1 / 5 / 15- minute moving-average count of runnable + uninterruptible-I/O- blocked tasks. Disk saturation raises Linux load average even with idle CPUs — this is a common surprise. Use trend of the three numbers to identify rising vs receding load.
  • concepts/cpu-utilization-vs-saturation — Canonical USE- Method distinction: utilisation (fraction of CPU busy) ≠ saturation (queue depth behind the CPU). vmstat's `us + sy
  • wagives the former;r` column gives the latter.
  • concepts/io-wait — CPU time reported as "idle because waiting on disk". Not itself a problem — a form of idle — but a pointer to disk-bottleneck investigation via iostat.
  • concepts/linux-page-cache — Linux uses otherwise-free memory for the page cache + buffer cache; claim is released on demand. free -m's -/+ buffers/cache row is the Linux-native way to see actual free memory; ZFS-on-Linux maintains a separate ARC cache free doesn't reflect.
  • concepts/cpu-time-breakdown — The us / sy / id / wa / st / hi / si / ni / guest / gnice columns across top / vmstat / mpstat; disambiguates application-CPU vs kernel-CPU vs disk- wait-idle vs hypervisor-stolen vs interrupt work.
  • patterns/sixty-second-performance-checklist — Netflix's canonical pattern: 10 stock Linux commands run in a defined order as the first minute of any performance investigation. Establishes utilisation / saturation / errors baseline before reaching for heavier tools. The pattern generalises: any multi-tenant compute substrate benefits from a cheap, repeatable, known-good-cost triage script.
  • patterns/utilization-saturation-errors-triage — The Utilisation / Saturation / Errors enumeration pattern — pick every resource, check every dimension, do not advance to deep-analysis until the checklist cleared.

Architectural numbers

  • 10 commands / 60 seconds — the checklist's budget.
  • r > CPU count = saturationvmstat interpretation rule.
  • %util > 60% usually hurts / ~100% usually saturatesiostat interpretation rule (with the LVM / RAID caveat).
  • %sys > 20% worth investigating — kernel-CPU share threshold.
  • avgqu-sz > 1 often = saturationiostat queue-depth threshold (with the multi-back-end-device caveat).
  • %steal > 0 = hypervisor-stolen cycles — the in-guest signature of concepts/noisy-neighbor co-tenancy on EC2 / Xen / other hypervisors.
  • Example Titus-era prod host: load average 30.02 / 26.43 / 19.02 on a 32-CPU box; us ≈ 98 / sy ≈ 1 / id ≈ 1; r ≈ 32-34; two Java processes at 1591% + 1583% CPU — a canonical CPU-saturated, user-space-bound, one-to-two-hot- Java-process shape.
  • Example sar -n DEV 1: eth0 receive at 21999 kB/s ≈ 22 MB/s ≈ 176 Mbit/s — well under a 1 Gbit/s limit.
  • Example dmesg: perl invoked oom-killer → 1.9 GB perl killed; TCP: Possible SYN flooding on port 7001. Dropping request. — both reframe investigation instantly.

Caveats

  • Published by Netflix in 2015 as a snapshot of Brendan Gregg's Velocity 2015 Linux Performance Tools tutorial; the Medium republication (fetched 2025-07-29) is the same content. The Linux kernel version referenced is 3.13.0-49-generic; column semantics have been stable for newer kernels but a handful of columns have been added (e.g. sar -n DEV 1 %ifutil). Where the original post flags "this version also has X", the caveat is version-specific.
  • %util has a well-known interpretation problem on modern NVMe / virtualised block devices that can service multiple I/Os concurrently — a 100% busy device is not necessarily saturated, as the post itself notes.
  • sar -n TCP,ETCP 1 retransmits are a joint network- unreliability + server-overload signal; disambiguating requires additional tooling.
  • The 60-second checklist is a triage artefact, not a root- cause tool; follow-on deep analysis (perf / ftrace / eBPF / flamegraphs / bcc / bpftrace) is an explicit next step in the post.
  • The commands all read /proc counters and are very cheap, but on a fully-saturated host even running them can be painful — Netflix's implicit answer is that you reach for this checklist early before getting deeply saturated, and for Atlas fleet-view when the host is beyond interactive triage.
  • The post frames the checklist as a precursor to the 40+ tool Linux Performance Tools tutorial; eBPF-era tracing (systems/ebpf, Netflix's run-queue-latency monitor, systems/bpftrace, Strobelight) is the deeper layer below this checklist.
  • This is a foundational / pedagogy post rather than a production retrospective — no specific Netflix incident dollar-figure or outage post-mortem, though the examples are from real production hosts.

Source

Last updated · 319 distilled / 1,201 read