CONCEPT Cited by 1 source
Asynchronous sequence packing¶
Definition¶
Asynchronous sequence packing is a dataloader pattern for variable-length LLM training: pack multiple short samples into fixed-length sequences with a document mask that prevents cross-attention between samples, and run the packing step asynchronously on CPU so that packing overlaps with GPU compute rather than serialising with it.
The technique tackles the straggler problem that FSDP-style training has with long-tail sequence length distributions: faster workers block at sync points waiting for the worker that happened to get the longest sample.
First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — where Netflix reports up to 4.7× effective token throughput on the most sequence-length-skewed internal dataset.
The problem it solves¶
Per Netflix:
"Padding within a batch can waste compute, and uneven shapes across FSDP workers can cause GPU synchronization overhead. [...] In FSDP-style training, long-tail sequences create stragglers: faster workers end up waiting at synchronization points for the slowest batch, lowering utilization. Standard bin-packing approaches help, but doing them offline at our data scale can add substantial preprocessing latency and make it harder to keep datasets fresh." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)
Two failure modes in the naive approach:
- Pad to the longest in the batch — wastes compute on padding tokens.
- Pre-pack offline — at Netflix-scale datasets, the preprocessing-latency + dataset-freshness cost is prohibitive.
How async packing works¶
- Stream samples from cloud/disk storage (no offline preprocessing required).
- In-memory pack multiple short samples into a fixed target length (e.g. 8K, 32K, context-length-of-the-model). Typical heuristic: greedy / first-fit-decreasing bin-packing with sample tokens as weights.
- Document mask prevents cross-attention between packed samples — conceptually the attention mask is block-diagonal with one block per source sample. (FlashAttention-family varlen kernels natively support this via cumulative-seqlen arrays.)
- Async on CPU: packing runs on CPU workers while GPU consumes the previous pre-packed batch. CPU time and GPU time overlap, not serialise.
GPU: [ step k ][ step k+1 ][ step k+2 ]
CPU: [ pack k+1 ][ pack k+2 ][ pack k+3 ]
^ packing runs during step k's compute
Why it's 4.7× (on the skewed dataset)¶
The source's Figure 5 reports up to 4.7× effective token throughput on the most sequence-length-skewed dataset (on both A100 and H200 GPUs). The win decomposes into:
- Elimination of padding waste — compute time is dominated by useful tokens, not pad tokens.
- Shape consistency across workers — packed sequences are fixed-length, so no straggler-induced sync stalls under FSDP.
- CPU/GPU overlap — zero or near-zero GPU idle waiting for the next pre-processed batch.
Throughput improvement depends on dataset skew: more skewed distributions (long tails + many short samples) benefit more. On uniformly-long datasets the win is smaller.