Skip to content

PATTERN Cited by 1 source

On-the-fly async sequence packing

Intent

Eliminate FSDP straggler stalls from long-tail sequence-length distributions in LLM training — without paying the offline-preprocessing + dataset-staleness cost of offline bin-packing — by streaming samples from storage, packing them in memory into fixed-length sequences with document masks, and running the packing asynchronously on CPU so it overlaps GPU compute.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflixup to 4.7× effective token throughput on the most sequence-length-skewed internal dataset.

Problem

Two approaches to variable-length training that this pattern rejects:

Naive: pad to longest in batch

  • Cheap to implement.
  • Wastes compute on padding tokens.
  • Straggler stalls under FSDP: faster workers block at sync points waiting on the worker that drew the longest sample.

Offline bin-packing

  • Pre-compute packed sequences before training starts.
  • Addresses padding waste + straggler stalls.
  • At Netflix-scale datasets, "adds substantial preprocessing latency and makes it harder to keep datasets fresh" (source).

Neither is acceptable for a production post-training framework that serves sequence-length-skewed datasets at scale.

Solution

On-the-fly async packing:

Storage (S3/local)
      │ stream samples
┌────────────────────────────────────┐
│ Dataloader worker (CPU)            │
│   - consumes stream                │
│   - bin-packs samples into         │
│     fixed-length target (e.g. 32K) │
│   - builds document mask           │
│     (block-diagonal attention)     │
└────────────────────────────────────┘
      │ pre-packed batches
┌────────────────────────────────────┐
│ GPU training step                  │
│   - forward/backward                │
│   - FlashAttention-varlen reads   │
│     document mask as cum-seqlens   │
└────────────────────────────────────┘

Timeline:
  GPU:  [ step k ][ step k+1 ][ step k+2 ]
  CPU:  [ pack k+1 ][ pack k+2 ][ pack k+3 ]
          ^ CPU packing overlaps GPU compute

Core design commitments:

  1. Stream, don't pre-process. Samples are read from storage on demand. No offline preprocessing job; dataset freshness is preserved.
  2. Memory pack. Multiple short samples → one fixed-length sequence. Heuristic is typically greedy / first-fit-decreasing bin-packing by token count.
  3. Document mask. Attention is block-diagonal across packed samples so no cross-sample information leakage. FlashAttention-family varlen kernels accept a cu_seqlens array that expresses exactly this.
  4. Async on CPU. Packing runs in dataloader workers while GPU consumes the previously-packed batch. CPU time ≤ GPU time per batch → zero GPU idle.

Where the 4.7× comes from

Figure 5 in the Netflix post reports the win on the most skewed internal dataset, on both A100 and H200 GPUs. Three compounding effects:

  • No padding waste — useful tokens ≈ total tokens in each batch.
  • No straggler stalls — all workers see fixed-length sequences; FSDP syncs don't block on one unlucky long sample.
  • No GPU idle — CPU packing overlaps GPU compute; the dataloader never gatekeeps.

Throughput gain is dataset-dependent — uniformly-long datasets see smaller wins. The pattern is specifically targeted at the long-tail-distribution case.

Applicability

  • ✅ Datasets with skewed sequence-length distributions (common for chat, CoT, mixed-source corpora).
  • ✅ Frameworks using FSDP or similar SPMD training where worker sync cost is load-bearing.
  • ✅ Datasets too large for offline preprocessing to be economical.
  • ❌ Datasets that are uniformly long or small enough that offline pre-packing is cheap.
  • ❌ Non-SPMD training loops where straggler stalls aren't a dominant cost.

Trade-offs

Benefit Cost
Up to 4.7× effective token throughput on skewed datasets More complex dataloader code path
Dataset freshness preserved (no offline preprocessing) CPU workers compete with other CPU work for cores
Compatible with streaming from cloud storage Requires FlashAttention-varlen or equivalent attention kernel
Fixed-length shapes across workers → no stragglers Document mask must be correctly wired through attention

Known uses

Last updated · 550 distilled / 1,221 read