CONCEPT Cited by 1 source

Cumulative downtime vs flap count¶

Definition¶

Cumulative downtime vs flap count is an operational insight for monitoring network link health in GPU training fabrics: the meaningful signal for training-run impact is total cumulative downtime of a port/link, not the number of flap events.

Why flap count fails¶

A single InfiniBand port-down event that lasts longer than NCCL_IB_TIMEOUT (~7s) is sufficient to crash a multi-day training run — regardless of whether the port ever flapped before or after. Conversely, many short flaps (each <7s) may not crash the run at all if each recovers within the timeout window.

Flap-count thresholds: - May trigger on harmless short-duration flaps (false positives) - May miss a single long-duration flap (false negatives — the dangerous case)

Cumulative-downtime thresholds: - Correctly identify that one 10s port-down event is more dangerous than ten 1s flaps - Map directly to training-run lethality via the NCCL_IB_TIMEOUT boundary

Discovery context¶

Databricks discovered this insight investigating a training run that crashed 7 hours in. A single IB port used for RDMA NCCL collectives went down once and recovered. Their continuous health checks monitored flap count, but a single isolated flap didn't trip the threshold. The crash happened because the single flap's duration exceeded NCCL_IB_TIMEOUT. They subsequently shifted monitoring from flap-count thresholds to cumulative-downtime thresholds.

(Source: sources/2026-07-01-databricks-gpu-reliability)

Generalisation¶

This insight generalises beyond InfiniBand: for any system where a timeout fires on duration (not count), the monitoring signal must also be duration-based. Count-based alerting creates a semantic mismatch with the failure mechanism.

Seen in¶

sources/2026-07-01-databricks-gpu-reliability — the post-mortem that motivated the shift.

concepts/nccl-ib-timeout — the timeout mechanism that makes duration the lethal metric
systems/gpu-monitor — the system that implements cumulative-downtime monitoring
patterns/multi-stage-health-check — the broader check architecture this insight feeds into