Skip to content

CONCEPT Cited by 1 source

Training checkpoint

Definition

A training checkpoint is a periodic dump of the full state required to resume a distributed training job after a failure: model weights, optimizer state (Adam m/v buffers), learning-rate-scheduler state, data-loader position, random-number-generator seeds, and (often) parallelism-aware partitioning metadata. Restoring from a checkpoint lets training resume at or near the last committed step rather than restarting from scratch.

Meta's 2024-06-12 GenAI infrastructure post names "efficient preservation of the training state" as one of four simultaneously-stressed properties at GPU-cluster scale:

"In the event of a failure, we need to be able to pick up where we left off. This means we need to regularly checkpoint our training state and efficiently store and retrieve training data." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

Why it's a scaling problem, not just a feature

At small scale, checkpointing is a side task: dump the model occasionally, ignore most of the performance cost. At the GenAI scale Meta describes (24K GPUs), three things stop being true:

  1. Failures are certain, not exceptional. More GPUs ⇒ more failures per unit time ⇒ MTTI shrinks below job-duration. Without frequent checkpoints, work is lost at a rate the cluster cannot afford.
  2. Checkpoint size grows with parameter count and optimizer state. A 70B model in bf16 is ~140 GB of weights alone; full Adam optimizer state in fp32 is another ~560 GB. At 405B (Llama 3.1-class), those numbers quadruple-plus. Written naively, checkpointing would saturate both GPU→host PCIe and host→storage fabrics.
  3. Checkpoint cadence becomes a throughput / recovery-time tradeoff. Too frequent → training throughput suffers from stop-the-world checkpoint cost. Too infrequent → a failure costs hours of recomputation.

Design axes a production training checkpoint must address

  • What to checkpoint — weights, optimizer, RNG, data-loader cursor, parallel-partitioning metadata.
  • How often — bounded by acceptable lost-work cost × failure rate × checkpoint cost.
  • Where to store — local NVMe for speed; parallel filesystem / object storage for durability; often both, tiered.
  • Sharded vs replicated — under concepts/3d-parallelism, each DP×TP×PP partition has a shard of state; sharded checkpoints map 1:1 to training partitions, but load-balancing on restore at different parallelism shape requires reshuffle.
  • Async vs sync — async checkpointing overlaps the write with the next training step; sync is simpler but lower throughput.
  • Resume fidelity — exact vs approximate resume; exact resume requires deterministic RNG + data-loader position + reduce-sum order.

Meta's post names the requirement but does not disclose implementation choices along these axes.

Distinct from other "checkpoint" concepts on this wiki

Seen in

Last updated · 319 distilled / 1,201 read