PATTERN Cited by 1 source

Partial-restart fault recovery¶

On a failure in a distributed job, restart only the affected resources — the specific pod, rank, or shard that failed — rather than restarting the whole job. Preserves in-progress work on all unaffected participants.

Why it matters for GPU training¶

The default shape for distributed training on Kubernetes (MPI jobs, Kubeflow training jobs, plain stateful-sets with collective libraries) is effectively all-or-nothing: if any rank dies, the collective communicator fails, the framework tears down, and the standard recovery path is "restart from last checkpoint." That wastes:

all compute done since the last checkpoint on every surviving rank,
the warm-up time to rebuild the communicator,
the GPU-memory setup cost on surviving ranks that had nothing wrong.

At hundreds-to-thousands of GPUs per job, with hours between checkpoints, every whole-job restart burns a meaningful fraction of a day of extremely expensive compute. Multiply by the baseline grey-failure rate (concepts/grey-failure) of a GPU fleet and the cost is cumulative.

Structure¶

Classify the failure domain. The orchestrator must distinguish "this one pod's GPU threw an ECC / stalled / OOMed" from "a collective invariant was violated cluster-wide."
Restart the minimum set. Replace only the failed pod/node; surviving peers keep model state, optimizer state, dataloader position.
Rejoin the collective. The new rank registers with the communicator and catches up — either via state broadcast from a peer (see patterns/state-transfer-on-reshard) or by re-reading the latest checkpoint for just its own shard.
Health signals include grey failures. Monitor for stalled batches (no progress for K steps) and non-numeric loss (NaN / Inf — the canonical "training went sideways" signal). These don't show up as pod crashes; the orchestrator has to watch for them explicitly.
Recovery policy is declarative. Teams shouldn't edit operator code to change recovery behavior — expose it as a config (restart-threshold-per-rank, give-up-after-N-restarts, escalate-to-human, etc.).

Canonical example: SageMaker HyperPod training operator¶

For teams using Kubernetes, we've added a HyperPod training operator that brings significant improvements to fault recovery. When failures occur, it restarts only the affected resources rather than the entire job. The operator also monitors for common training issues such as stalled batches and non-numeric loss values. Teams can define custom recovery policies through straightforward YAML configurations. These capabilities dramatically reduce both resource waste and operational overhead.

(Source: sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development)

The operator sits on systems/kubernetes and improves on the default Job / collective-training-operator contract by collapsing the "everything goes down together" behavior.

Trade-offs¶

Requires framework cooperation. The training framework (PyTorch DDP / FSDP, etc.) has to support dynamic membership in the collective. Some collectives tear down on any peer loss; elastic variants (e.g. torchelastic) are the prerequisite.
Grey-failure detection must have low false-positive rate. Thrashing a rank because it's one step behind the collective is worse than leaving it alone.
Checkpoint strategy still matters. Partial restart helps with transient faults; persistent corruption (bad data shard, wrong weights) still needs whole-job rollback.
Declarative recovery policy risks misuse. A too-aggressive give-up-after-N can kill a training run on transient infrastructure churn; a too-permissive one masks the underlying fault.

Adjacent patterns¶

patterns/state-transfer-on-reshard — Dicer's pattern of transferring per-slice state between pods on resharding is the same shape in a different domain (sharded in-memory caches vs. distributed training). Both preserve work across pod replacement; both require the framework to understand "my state came from somewhere else."
Circuit breakers and bulkheads are softer-grained versions of the same idea at the request level.

Seen in¶

sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — HyperPod training operator; partial restart + stalled-batch / NaN-loss detection + YAML recovery policies as the three pieces.

concepts/grey-failure — the failure class the stalled-batch monitoring is designed to catch.
systems/aws-sagemaker-hyperpod
systems/kubernetes
patterns/state-transfer-on-reshard