PATTERN Cited by 1 source
On-the-fly async sequence packing¶
Intent¶
Eliminate FSDP straggler stalls from long-tail sequence-length distributions in LLM training — without paying the offline-preprocessing + dataset-staleness cost of offline bin-packing — by streaming samples from storage, packing them in memory into fixed-length sequences with document masks, and running the packing asynchronously on CPU so it overlaps GPU compute.
First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — up to 4.7× effective token throughput on the most sequence-length-skewed internal dataset.
Problem¶
Two approaches to variable-length training that this pattern rejects:
Naive: pad to longest in batch¶
- Cheap to implement.
- Wastes compute on padding tokens.
- Straggler stalls under FSDP: faster workers block at sync points waiting on the worker that drew the longest sample.
Offline bin-packing¶
- Pre-compute packed sequences before training starts.
- Addresses padding waste + straggler stalls.
- At Netflix-scale datasets, "adds substantial preprocessing latency and makes it harder to keep datasets fresh" (source).
Neither is acceptable for a production post-training framework that serves sequence-length-skewed datasets at scale.
Solution¶
On-the-fly async packing:
Storage (S3/local)
│ stream samples
▼
┌────────────────────────────────────┐
│ Dataloader worker (CPU) │
│ - consumes stream │
│ - bin-packs samples into │
│ fixed-length target (e.g. 32K) │
│ - builds document mask │
│ (block-diagonal attention) │
└────────────────────────────────────┘
│ pre-packed batches
▼
┌────────────────────────────────────┐
│ GPU training step │
│ - forward/backward │
│ - FlashAttention-varlen reads │
│ document mask as cum-seqlens │
└────────────────────────────────────┘
Timeline:
GPU: [ step k ][ step k+1 ][ step k+2 ]
CPU: [ pack k+1 ][ pack k+2 ][ pack k+3 ]
^ CPU packing overlaps GPU compute
Core design commitments:
- Stream, don't pre-process. Samples are read from storage on demand. No offline preprocessing job; dataset freshness is preserved.
- Memory pack. Multiple short samples → one fixed-length sequence. Heuristic is typically greedy / first-fit-decreasing bin-packing by token count.
- Document mask. Attention is block-diagonal across packed samples so no cross-sample information leakage. FlashAttention-family varlen kernels accept a
cu_seqlensarray that expresses exactly this. - Async on CPU. Packing runs in dataloader workers while GPU consumes the previously-packed batch. CPU time ≤ GPU time per batch → zero GPU idle.
Where the 4.7× comes from¶
Figure 5 in the Netflix post reports the win on the most skewed internal dataset, on both A100 and H200 GPUs. Three compounding effects:
- No padding waste — useful tokens ≈ total tokens in each batch.
- No straggler stalls — all workers see fixed-length sequences; FSDP syncs don't block on one unlucky long sample.
- No GPU idle — CPU packing overlaps GPU compute; the dataloader never gatekeeps.
Throughput gain is dataset-dependent — uniformly-long datasets see smaller wins. The pattern is specifically targeted at the long-tail-distribution case.
Applicability¶
- ✅ Datasets with skewed sequence-length distributions (common for chat, CoT, mixed-source corpora).
- ✅ Frameworks using FSDP or similar SPMD training where worker sync cost is load-bearing.
- ✅ Datasets too large for offline preprocessing to be economical.
- ❌ Datasets that are uniformly long or small enough that offline pre-packing is cheap.
- ❌ Non-SPMD training loops where straggler stalls aren't a dominant cost.
Trade-offs¶
| Benefit | Cost |
|---|---|
| Up to 4.7× effective token throughput on skewed datasets | More complex dataloader code path |
| Dataset freshness preserved (no offline preprocessing) | CPU workers compete with other CPU work for cores |
| Compatible with streaming from cloud storage | Requires FlashAttention-varlen or equivalent attention kernel |
| Fixed-length shapes across workers → no stragglers | Document mask must be correctly wired through attention |
Known uses¶
- Netflix Post-Training Framework (2026-02) — canonical instance. Reported up to 4.7× throughput on the most skewed internal dataset on A100 + H200.