CONCEPT Cited by 1 source

Training checkpoint¶

Definition¶

A training checkpoint is a periodic dump of the full state required to resume a distributed training job after a failure: model weights, optimizer state (Adam m/v buffers), learning-rate-scheduler state, data-loader position, random-number-generator seeds, and (often) parallelism-aware partitioning metadata. Restoring from a checkpoint lets training resume at or near the last committed step rather than restarting from scratch.

Meta's 2024-06-12 GenAI infrastructure post names "efficient preservation of the training state" as one of four simultaneously-stressed properties at GPU-cluster scale:

"In the event of a failure, we need to be able to pick up where we left off. This means we need to regularly checkpoint our training state and efficiently store and retrieve training data." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

Why it's a scaling problem, not just a feature¶

At small scale, checkpointing is a side task: dump the model occasionally, ignore most of the performance cost. At the GenAI scale Meta describes (24K GPUs), three things stop being true:

Failures are certain, not exceptional. More GPUs ⇒ more failures per unit time ⇒ MTTI shrinks below job-duration. Without frequent checkpoints, work is lost at a rate the cluster cannot afford.
Checkpoint size grows with parameter count and optimizer state. A 70B model in bf16 is ~140 GB of weights alone; full Adam optimizer state in fp32 is another ~560 GB. At 405B (Llama 3.1-class), those numbers quadruple-plus. Written naively, checkpointing would saturate both GPU→host PCIe and host→storage fabrics.
Checkpoint cadence becomes a throughput / recovery-time tradeoff. Too frequent → training throughput suffers from stop-the-world checkpoint cost. Too infrequent → a failure costs hours of recomputation.

Design axes a production training checkpoint must address¶

What to checkpoint — weights, optimizer, RNG, data-loader cursor, parallel-partitioning metadata.
How often — bounded by acceptable lost-work cost × failure rate × checkpoint cost.
Where to store — local NVMe for speed; parallel filesystem / object storage for durability; often both, tiered.
Sharded vs replicated — under concepts/3d-parallelism, each DP×TP×PP partition has a shard of state; sharded checkpoints map 1:1 to training partitions, but load-balancing on restore at different parallelism shape requires reshuffle.
Async vs sync — async checkpointing overlaps the write with the next training step; sync is simpler but lower throughput.
Resume fidelity — exact vs approximate resume; exact resume requires deterministic RNG + data-loader position + reduce-sum order.

Meta's post names the requirement but does not disclose implementation choices along these axes.

Distinct from other "checkpoint" concepts on this wiki¶

concepts/first-class-checkpoint-restore — a sandbox-VM pattern (e.g. Fly Sprites) where checkpoint/restore is an ordinary-course user primitive. Different primitive, different purpose.
concepts/fast-checkpoint-via-metadata-shuffle — an object-store-backed VFS checkpoint pattern; again different primitive.
This concept is the distributed-training checkpoint — specific to multi-GPU deep-learning workloads.

Seen in¶

sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — Meta names checkpoint/recovery as one of four stressed properties at 24K-GPU scale. No implementation disclosure.

concepts/hardware-reliability-at-scale — the failure-rate driver that makes checkpointing load-bearing.
concepts/gpu-training-failure-modes — the specific failure modes checkpoint/recovery must handle.
concepts/3d-parallelism / concepts/data-parallelism — the parallelism structure that shapes sharded checkpoint layout.
systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — the deployments at which the problem manifests.