Skip to content

CONCEPT Cited by 1 source

CPU utilisation ceiling on databases

Definition

Database workloads should not run hot on CPU. Zalando's empirical observation from running an RDS Postgres fleet: CPU utilisation above 40-60% on a database instance is an incident precursor, not a healthy saturation target.

"Typical database workloads are bound to memory or storage, high CPU is an anomaly that requires further investigation. Our past experience advises us that CPU utilisation over 40% - 60% on database instances eventually leads to incidents."

sources/2024-02-19-zalando-twelve-golden-signals

Why this is counter-intuitive

For general-purpose stateless services, engineers target 70-80% CPU utilisation as the capacity-headroom point — below that there's wasted hardware, above it the p99 starts degrading from queue buildup. Applying that heuristic to databases is wrong: on a healthy OLTP database, the CPU should be bored, waiting on memory and storage.

Why databases are memory/IO-bound by design

A well-provisioned OLTP database has:

  • A buffer cache sized to fit the working set — Postgres shared_buffers, MySQL InnoDB buffer pool. Queries hit RAM, not disk, most of the time.
  • Storage IO that is the expensive operation — even with high cache hit ratios, the cache misses that do reach disk dominate end-to-end query latency. The CPU spends its time waiting on block reads.
  • Simple per-row work — the CPU work per row (parse, plan, execute a row op, evaluate predicates) is modest; the dominant cost is moving data between memory tiers, not transforming it.

If CPU is hot, one of the assumptions has broken: the working set exceeded RAM (queries are hitting CPU-resident decompression or sort paths), a bad query plan is doing full scans in CPU, a connection spike is multiplying per-query overhead, or the application is hammering the database with something the planner is struggling with. High CPU is a diagnostic signal that something is wrong — not a capacity-utilised signal.

The companion signal: CPU await

Zalando pairs C1 (os.cpuUtilization.total) with C2 (os.cpuUtilization.await). CPU await — the fraction of time CPUs spend waiting for IO requests to complete — locates where the CPU bottleneck is: if C1 is hot but C2 is low, CPU is doing work; if C1 is hot and C2 is also high (>5-10%), the CPU is waiting on storage bandwidth.

Why the 40-60% band is empirical

Zalando is explicit that the number comes from production incidents, not first principles: "Our past experience advises us that CPU utilisation over 40% - 60% on database instances eventually leads to incidents." Other fleets will have different breakpoints. The direction is universal (CPU hot = database sick); the number needs calibration per-fleet.

Operational consequence

Capacity-planning dashboards that treat database CPU as "headroom" and scale to 70-80% like stateless services will run the database into incident territory. Teams using golden signals for RDS as their fleet methodology should alert at 30-40% as a warning and 60% as a hard alert — well before a stateless service would pay attention to the same number.

Seen in

Last updated · 501 distilled / 1,218 read