CONCEPT Cited by 1 source
Cumulative downtime vs flap count¶
Definition¶
Cumulative downtime vs flap count is an operational insight for monitoring network link health in GPU training fabrics: the meaningful signal for training-run impact is total cumulative downtime of a port/link, not the number of flap events.
Why flap count fails¶
A single InfiniBand port-down event that lasts longer than NCCL_IB_TIMEOUT (~7s) is sufficient to crash a multi-day training run — regardless of whether the port ever flapped before or after. Conversely, many short flaps (each <7s) may not crash the run at all if each recovers within the timeout window.
Flap-count thresholds: - May trigger on harmless short-duration flaps (false positives) - May miss a single long-duration flap (false negatives — the dangerous case)
Cumulative-downtime thresholds: - Correctly identify that one 10s port-down event is more dangerous than ten 1s flaps - Map directly to training-run lethality via the NCCL_IB_TIMEOUT boundary
Discovery context¶
Databricks discovered this insight investigating a training run that crashed 7 hours in. A single IB port used for RDMA NCCL collectives went down once and recovered. Their continuous health checks monitored flap count, but a single isolated flap didn't trip the threshold. The crash happened because the single flap's duration exceeded NCCL_IB_TIMEOUT. They subsequently shifted monitoring from flap-count thresholds to cumulative-downtime thresholds.
(Source: sources/2026-07-01-databricks-gpu-reliability)
Generalisation¶
This insight generalises beyond InfiniBand: for any system where a timeout fires on duration (not count), the monitoring signal must also be duration-based. Count-based alerting creates a semantic mismatch with the failure mechanism.
Seen in¶
- sources/2026-07-01-databricks-gpu-reliability — the post-mortem that motivated the shift.
Related¶
- concepts/nccl-ib-timeout — the timeout mechanism that makes duration the lethal metric
- systems/gpu-monitor — the system that implements cumulative-downtime monitoring
- patterns/multi-stage-health-check — the broader check architecture this insight feeds into