CONCEPT Cited by 1 source

Grey failure¶

Grey failure names a component that is not fully broken but not fully healthy — partially, intermittently, or sub-specification degraded. It is the failure mode that health-check booleans and up/down monitoring are structurally unable to catch: the node answers liveness probes, emits its usual metrics, reports its usual status — and silently drags user-visible behavior down.

The term captures two facts:

Binary failure monitoring is insufficient at scale. Most failures on a fleet of thousands of GPUs, drives, or pods are not "node crashed." They are "node is a little slower / lossier / hotter than nominal."
Grey failures compound through fanout. A job that depends on N components tolerates a whole node going down (restart, reroute) far better than one node going slow — because slow propagates through synchronous dependencies and poisons the whole operation. This is the same math as concepts/tail-latency-at-scale.

Canonical examples¶

GPU thermal throttling. Card gets hot; firmware drops clock speed to stay in spec. Job runs to completion but at, say, 40% nominal throughput. Liveness probes pass. Utilization metrics pass. Only clock-speed / per-step-time metrics show it — and only if someone is looking at them. (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)
NIC packet loss under load. Interface passes ping; TCP retransmits transparently; upper-layer RPCs see latency inflation but keep succeeding. Again, the failure is invisible to boolean health. (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)
HDD hotspots in S3. Drive is not failed, not slow enough to trigger its SMART signals, but it's queueing requests and amplifying tail latency into metadata lookups and concepts/erasure-coding reconstructs. Structurally identical problem — hence the need for concepts/heat-management / patterns/data-placement-spreading / patterns/redundancy-for-heat as a structural answer rather than per-drive triage. (Context: sources/2025-02-25-allthingsdistributed-building-and-operating-s3)

Detection strategies¶

High-cardinality correlation. Look for outliers in continuous metrics (per-GPU per-step time, per-NIC packet-drop rate, per-drive p99 IO latency) across a fleet and flag statistical deviations, not threshold crossings.
Proactive alerting on degradation, not failure. Alert when throughput/latency trends wrong, not when a threshold breaks.
Structural isolation. Where individual grey-failing nodes cannot be reliably detected in time, the system is built so any given node's misbehavior is bounded in blast radius (S3's patterns/data-placement-spreading + patterns/redundancy-for-heat; partial-restart recovery at the training-job level — patterns/partial-restart-fault-recovery).

Why it matters for ML / GPU workloads specifically¶

A distributed training job often runs synchronous all-reduce / gradient exchanges; any slow rank stalls the collective. One grey-failing GPU throttles the entire job to its pace. The customary response — checkpoint and restart — wastes hours of expensive compute when the grey fault is actually "this one GPU is 10 °C hotter than nominal." The SageMaker HyperPod observability capability is explicitly built around detecting grey failures before they cascade. (Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

Seen in¶

sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — HyperPod observability as an explicit grey-failure detector; thermal throttling and NIC packet loss named as the canonical examples.

concepts/observability — grey-failure detection is an advanced requirement on the observability pipeline.
concepts/monitoring-paradox — if the monitoring layer itself can grey-fail (collectors throttled, agents disk-full), the whole scheme collapses.
concepts/tail-latency-at-scale — grey failures are the supply side of the tail-at-scale problem; one grey node × fanout = whole operation's latency.
concepts/heat-management — S3's structural answer to grey-failing HDDs.
systems/aws-sagemaker-hyperpod

Grey failure¶

Canonical examples¶

Detection strategies¶

Why it matters for ML / GPU workloads specifically¶

Seen in¶

Related¶