Skip to content

CONCEPT Cited by 1 source

Slow is failure

Definition

Slow is failure is the operational premise that for customer-facing real-time workloads — OLTP databases, API backends, interactive request/response — a component that responds too slowly is indistinguishable from a component that crashed. A 10-second response on a 100 ms-budgeted request path is a 500 to the user and an incident to the operator, regardless of the component's own reported health.

"'slow' is often as bad as 'failed', and that happens much much more often." (Source: sources/2025-03-18-planetscale-the-real-failure-rate-of-ebs.)

Canonical EBS worked example

From the PlanetScale post:

This volume has been operating steadily for at least 10 hours. AWS has reported it at 67% idle, with write latency measuring at single-digit ms/operation. Well within expectations. Suddenly, at around 16:00, write latency spikes to 200ms–500ms/operation, idle time races to zero, and the volume is effectively blocked from reading and writing data. To the application running on top of this database: this is a failure. To the user, this is a 500 response on a webpage after a 10 second wait. To you, this is an incident. At PlanetScale, we consider this full failure because our customers do.

The volume is "up" in every AWS-reported sense. The AWS SLO is not breached — gp3 promises "at least 90 percent of provisioned IOPS 99 percent of the time", and this is well inside the 1% allowance. The customer counts it as full failure because the customer's customers do.

The framing choice

"Slow is failure" is a customer-centric framing — it collapses the "degraded" category into "failed" for the purpose of operational response. This is structurally different from a provider-centric framing where degraded performance is a lesser-grade event with its own SLO budget.

The two framings disagree on:

  • What counts as an incident (customer: latency spike; provider: SLO breach).
  • Whether the event should page someone (customer: yes; provider: often no, because it's inside the SLO).
  • Whether mitigation is a bug or a feature (customer: bug; provider: expected, hence the SLO wording).

"Slow is failure" is the correct framing for the layer that talks to end users; the upstream-SLO framing is correct for the provider. Two layers, two framings.

Design implications

  • SLOs have to include latency quantiles, not just error rate. p99 write latency > 100 ms for 60s is the kind of condition that detects "slow is failure".
  • Automatic mitigation over human paging. Because these events are frequent at scale, manual pages don't scale. Canonical wiki pattern is patterns/automated-volume-health-monitoring + patterns/zero-downtime-reparent-on-degradation.
  • Over-provisioning doesn't close the gap. "When there are no guarantees, even overprovisioning doesn't solve the problem." If the substrate can deliver arbitrarily low throughput for an arbitrarily long duration, no static headroom figure is enough.
  • Structural fix beats operational mitigation. The load-bearing architectural move is to pick a substrate with a bounded lower floor — for OLTP databases, this typically means direct-attached-NVMe + cluster replication instead of network-attached block storage. See patterns/direct-attached-nvme-with-replication + patterns/shared-nothing-storage-topology.

Seen in

Last updated · 319 distilled / 1,201 read