Skip to content

CONCEPT Cited by 1 source

Storage IO latency SLI thresholds

Definition

A three-tier ladder of block-device latency thresholds for databases, derived from Zalando's empirical RDS Postgres fleet observations:

  • < 5 ms — healthy. "A typical storage latency for database systems should be less than 4 - 5 ms."
  • 5-10 ms — application SLO impact. "The latency above 5 ms impacts on applications SLOs."
  • > 10 ms — incident precursor. "Storage latency above 10 ms eventually leads to incident."

(Source: sources/2024-02-19-zalando-twelve-golden-signals.)

Why a ladder, not a boolean

Traditional alerting sets a single disk-latency threshold and pages once it's crossed. Zalando's three-tier formulation gives operators a calibrated response ladder: healthy → watchlist → page. Below 5ms, nothing to do. Between 5 and 10ms, SLO budget is being consumed — investigate, but don't page. Above 10ms, incident probability is high enough to page proactively.

This mirrors concepts/multi-window-multi-burn-rate logic applied to a cause-metric: faster burn rate = more urgent response.

The two signals

The same phenomenon shows up twice in the 12 golden signals at different altitudes:

  • D3: os.diskIO.rdsdev.await — block-device-level latency, measured by the Linux kernel on the RDS data volume. This is the physical IO latency, including EBS network round trip, device queueing, and actual storage service time.
  • P2: db.IO.blk_read_time — Postgres-level latency, measured by Postgres when reading blocks from storage into the buffer cache. Includes the OS-level read cost plus any Postgres-side overhead.

The two should track each other closely. Divergence indicates something intermediating between Postgres and the block device (OS page cache eviction churn, etc.).

Why 10ms is the incident line

At 10ms per block read, a query that touches 1000 blocks spends 10 seconds in storage IO. Queries that should finish in tens of milliseconds stretch into seconds; tail latencies inflate; downstream timeouts fire; connection pools back up; the application degrades.

The 10ms number reflects a combination of:

  • EBS-over-network physics. GP2/GP3 EBS volumes have a baseline latency profile that is fast when healthy but degrades under throttling, neighbour contention, or IOPS-burst exhaustion.
  • Postgres buffer-cache miss cost. A cache miss at 10ms means every top query hitting cold data is slow.
  • Application-level timeout budgets. Most app code assumes sub-10ms DB reads in its latency budget.

See concepts/ebs-iops-burst-bucket and concepts/storage-latency-hierarchy for the EBS- specific mechanisms that cause latency to climb.

Seen in

Last updated · 501 distilled / 1,218 read