PATTERN Cited by 1 source

On-the-fly async sequence packing¶

Intent¶

Eliminate FSDP straggler stalls from long-tail sequence-length distributions in LLM training — without paying the offline-preprocessing + dataset-staleness cost of offline bin-packing — by streaming samples from storage, packing them in memory into fixed-length sequences with document masks, and running the packing asynchronously on CPU so it overlaps GPU compute.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — up to 4.7× effective token throughput on the most sequence-length-skewed internal dataset.

Problem¶

Two approaches to variable-length training that this pattern rejects:

Naive: pad to longest in batch¶

Cheap to implement.
Wastes compute on padding tokens.
Straggler stalls under FSDP: faster workers block at sync points waiting on the worker that drew the longest sample.

Offline bin-packing¶

Pre-compute packed sequences before training starts.
Addresses padding waste + straggler stalls.
At Netflix-scale datasets, "adds substantial preprocessing latency and makes it harder to keep datasets fresh" (source).

Neither is acceptable for a production post-training framework that serves sequence-length-skewed datasets at scale.

Solution¶

On-the-fly async packing:

Storage (S3/local)
      │ stream samples
      ▼
┌────────────────────────────────────┐
│ Dataloader worker (CPU)            │
│   - consumes stream                │
│   - bin-packs samples into         │
│     fixed-length target (e.g. 32K) │
│   - builds document mask           │
│     (block-diagonal attention)     │
└────────────────────────────────────┘
      │ pre-packed batches
      ▼
┌────────────────────────────────────┐
│ GPU training step                  │
│   - forward/backward                │
│   - FlashAttention-varlen reads   │
│     document mask as cum-seqlens   │
└────────────────────────────────────┘

Timeline:
  GPU:  [ step k ][ step k+1 ][ step k+2 ]
  CPU:  [ pack k+1 ][ pack k+2 ][ pack k+3 ]
          ^ CPU packing overlaps GPU compute

Core design commitments:

Stream, don't pre-process. Samples are read from storage on demand. No offline preprocessing job; dataset freshness is preserved.
Memory pack. Multiple short samples → one fixed-length sequence. Heuristic is typically greedy / first-fit-decreasing bin-packing by token count.
Document mask. Attention is block-diagonal across packed samples so no cross-sample information leakage. FlashAttention-family varlen kernels accept a cu_seqlens array that expresses exactly this.
Async on CPU. Packing runs in dataloader workers while GPU consumes the previously-packed batch. CPU time ≤ GPU time per batch → zero GPU idle.

Where the 4.7× comes from¶

Figure 5 in the Netflix post reports the win on the most skewed internal dataset, on both A100 and H200 GPUs. Three compounding effects:

No padding waste — useful tokens ≈ total tokens in each batch.
No straggler stalls — all workers see fixed-length sequences; FSDP syncs don't block on one unlucky long sample.
No GPU idle — CPU packing overlaps GPU compute; the dataloader never gatekeeps.

Throughput gain is dataset-dependent — uniformly-long datasets see smaller wins. The pattern is specifically targeted at the long-tail-distribution case.

Applicability¶

✅ Datasets with skewed sequence-length distributions (common for chat, CoT, mixed-source corpora).
✅ Frameworks using FSDP or similar SPMD training where worker sync cost is load-bearing.
✅ Datasets too large for offline preprocessing to be economical.
❌ Datasets that are uniformly long or small enough that offline pre-packing is cheap.
❌ Non-SPMD training loops where straggler stalls aren't a dominant cost.

Trade-offs¶

Benefit	Cost
Up to 4.7× effective token throughput on skewed datasets	More complex dataloader code path
Dataset freshness preserved (no offline preprocessing)	CPU workers compete with other CPU work for cores
Compatible with streaming from cloud storage	Requires FlashAttention-varlen or equivalent attention kernel
Fixed-length shapes across workers → no stragglers	Document mask must be correctly wired through attention

Known uses¶

Netflix Post-Training Framework (2026-02) — canonical instance. Reported up to 4.7× throughput on the most skewed internal dataset on A100 + H200.