Skip to content

PATTERN Cited by 1 source

Utilization / Saturation / Errors triage

Pattern

For every resource in a system, measure three orthogonal dimensions — utilisation (busy fraction), saturation (queue depth / wait time), errors (counted events) — and only advance to root-cause analysis once every resource has been checked on every dimension. This is the enumeration discipline underneath the USE Method.

When to apply

Any time the question is "where is the bottleneck?" — whether the resource is an OS-layer CPU / disk / NIC, a distributed-system resource like a Kafka broker or a Redis shard, or an application- layer resource like a thread pool or connection pool.

The discipline

  1. Enumerate every resource class in the system. OS-layer: CPUs, memory buses, network interfaces, storage devices, I/O buses. Application-layer: thread pools, connection pools, queue depths, cache slots, tenant slots. Distributed-system- layer: broker partitions, database shards, cache nodes, gateway slots.
  2. For each resource, answer three questions. What is its utilisation? What is its saturation (queue depth or wait time)? What errors has it produced?
  3. Look for errors and saturation first. Both have sharp thresholds (did an error happen? is the queue deeper than the service rate?). Utilisation is a gradient; interpretation requires context.
  4. Exonerate resources as you go. "Also pay attention to when you have checked and exonerated a resource, as by process of elimination this narrows the targets to study."
  5. Do not advance to root cause until the sweep is complete. It's tempting to dive into the first suspicious signal; this habitually leads to investigating the wrong resource.

The utilization-vs-saturation trap

Most immature triage processes measure only utilisation and treat "X is 99% busy" as the answer. But 99% utilisation with an empty queue is usually fine (throughput is high, latency is stable); 99% utilisation with a growing queue is the classic saturation bottleneck. See concepts/cpu-utilization-vs-saturation for the CPU instance.

Specific examples from Linux tooling that embody the distinction:

  • CPU: utilisation = us + sy in vmstat; saturation = r column in vmstat (run queue depth) or run queue latency via eBPF.
  • Disk: utilisation = %util in iostat; saturation = avgqu-sz or await latency.
  • NIC: utilisation = rxkB/s + txkB/s vs link capacity in sar -n DEV; saturation = packet drops in ifconfig / ip -s link.
  • Memory: utilisation = used vs total; saturation = si / so (swap-in/out) in vmstat; errors = OOM kills in dmesg.
  • TCP: utilisation = connection count; saturation = retransmits in sar -n TCP,ETCP; errors = isegerr/s / orsts/s.

Extension beyond OS resources

The pattern generalises cleanly:

  • Thread pool: utilisation = active threads / max; saturation = queued tasks; errors = task rejections.
  • Connection pool: utilisation = in-use / max; saturation = waiters; errors = timeouts.
  • Kafka partition: utilisation = produce rate; saturation = consumer lag; errors = broker-side rejection / retry counters.
  • Cache: utilisation = memory used / capacity; saturation = evictions per second; errors = cache errors / corruption counters.

If your observability stack is missing any of the three dimensions for a resource, you can't triage it with USE — and you should fix the observability gap before the next incident.

Benefit

  • Complete: you can't miss a dimension by construction.
  • Fast: each question has a known tool / metric.
  • Exonerate-as-you-go: reduces the search space step by step.
  • Transferable: the same discipline works on OS, app, and distributed-system layers.

Seen in

Last updated · 319 distilled / 1,201 read