Skip to content

CONCEPT Cited by 1 source

Grey failure

Grey failure names a component that is not fully broken but not fully healthy — partially, intermittently, or sub-specification degraded. It is the failure mode that health-check booleans and up/down monitoring are structurally unable to catch: the node answers liveness probes, emits its usual metrics, reports its usual status — and silently drags user-visible behavior down.

The term captures two facts:

  1. Binary failure monitoring is insufficient at scale. Most failures on a fleet of thousands of GPUs, drives, or pods are not "node crashed." They are "node is a little slower / lossier / hotter than nominal."
  2. Grey failures compound through fanout. A job that depends on N components tolerates a whole node going down (restart, reroute) far better than one node going slow — because slow propagates through synchronous dependencies and poisons the whole operation. This is the same math as concepts/tail-latency-at-scale.

Canonical examples

Detection strategies

  • High-cardinality correlation. Look for outliers in continuous metrics (per-GPU per-step time, per-NIC packet-drop rate, per-drive p99 IO latency) across a fleet and flag statistical deviations, not threshold crossings.
  • Proactive alerting on degradation, not failure. Alert when throughput/latency trends wrong, not when a threshold breaks.
  • Structural isolation. Where individual grey-failing nodes cannot be reliably detected in time, the system is built so any given node's misbehavior is bounded in blast radius (S3's patterns/data-placement-spreading + patterns/redundancy-for-heat; partial-restart recovery at the training-job level — patterns/partial-restart-fault-recovery).

Why it matters for ML / GPU workloads specifically

A distributed training job often runs synchronous all-reduce / gradient exchanges; any slow rank stalls the collective. One grey-failing GPU throttles the entire job to its pace. The customary response — checkpoint and restart — wastes hours of expensive compute when the grey fault is actually "this one GPU is 10 °C hotter than nominal." The SageMaker HyperPod observability capability is explicitly built around detecting grey failures before they cascade. (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

Seen in

Last updated · 200 distilled / 1,178 read