CONCEPT Cited by 1 source
Training checkpoint¶
Definition¶
A training checkpoint is a periodic dump of the full state required to resume a distributed training job after a failure: model weights, optimizer state (Adam m/v buffers), learning-rate-scheduler state, data-loader position, random-number-generator seeds, and (often) parallelism-aware partitioning metadata. Restoring from a checkpoint lets training resume at or near the last committed step rather than restarting from scratch.
Meta's 2024-06-12 GenAI infrastructure post names "efficient preservation of the training state" as one of four simultaneously-stressed properties at GPU-cluster scale:
"In the event of a failure, we need to be able to pick up where we left off. This means we need to regularly checkpoint our training state and efficiently store and retrieve training data." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
Why it's a scaling problem, not just a feature¶
At small scale, checkpointing is a side task: dump the model occasionally, ignore most of the performance cost. At the GenAI scale Meta describes (24K GPUs), three things stop being true:
- Failures are certain, not exceptional. More GPUs ⇒ more failures per unit time ⇒ MTTI shrinks below job-duration. Without frequent checkpoints, work is lost at a rate the cluster cannot afford.
- Checkpoint size grows with parameter count and optimizer state. A 70B model in bf16 is ~140 GB of weights alone; full Adam optimizer state in fp32 is another ~560 GB. At 405B (Llama 3.1-class), those numbers quadruple-plus. Written naively, checkpointing would saturate both GPU→host PCIe and host→storage fabrics.
- Checkpoint cadence becomes a throughput / recovery-time tradeoff. Too frequent → training throughput suffers from stop-the-world checkpoint cost. Too infrequent → a failure costs hours of recomputation.
Design axes a production training checkpoint must address¶
- What to checkpoint — weights, optimizer, RNG, data-loader cursor, parallel-partitioning metadata.
- How often — bounded by acceptable lost-work cost × failure rate × checkpoint cost.
- Where to store — local NVMe for speed; parallel filesystem / object storage for durability; often both, tiered.
- Sharded vs replicated — under concepts/3d-parallelism, each DP×TP×PP partition has a shard of state; sharded checkpoints map 1:1 to training partitions, but load-balancing on restore at different parallelism shape requires reshuffle.
- Async vs sync — async checkpointing overlaps the write with the next training step; sync is simpler but lower throughput.
- Resume fidelity — exact vs approximate resume; exact resume requires deterministic RNG + data-loader position + reduce-sum order.
Meta's post names the requirement but does not disclose implementation choices along these axes.
Distinct from other "checkpoint" concepts on this wiki¶
- concepts/first-class-checkpoint-restore — a sandbox-VM pattern (e.g. Fly Sprites) where checkpoint/restore is an ordinary-course user primitive. Different primitive, different purpose.
- concepts/fast-checkpoint-via-metadata-shuffle — an object-store-backed VFS checkpoint pattern; again different primitive.
- This concept is the distributed-training checkpoint — specific to multi-GPU deep-learning workloads.
Seen in¶
- sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — Meta names checkpoint/recovery as one of four stressed properties at 24K-GPU scale. No implementation disclosure.
Related¶
- concepts/hardware-reliability-at-scale — the failure-rate driver that makes checkpointing load-bearing.
- concepts/gpu-training-failure-modes — the specific failure modes checkpoint/recovery must handle.
- concepts/3d-parallelism / concepts/data-parallelism — the parallelism structure that shapes sharded checkpoint layout.
- systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — the deployments at which the problem manifests.