CONCEPT Cited by 1 source
Loss masking assistant tokens¶
Definition¶
Loss masking in SFT is the practice of explicitly marking which tokens in a training example contribute to the loss and which don't. For instruction-following / chat / multi-turn dialogue / Chain-of-Thought training, only the assistant tokens — the target completions — should be optimised. Prompt tokens, user turns, system messages, and chat-template scaffolding should be ignored by the loss function.
Without explicit loss masking, the model is trained on the prompt as well as the completion, which degrades quality by teaching the model to reproduce arbitrary user-provided text rather than to respond to it.
First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix.
The trap HF chat templates don't close¶
Netflix's framing:
"Hugging Face chat templates serialize conversations, but don't specify what to train on versus ignore. The pipeline must apply explicit loss masking so only assistant tokens are optimized; otherwise the model learns from prompts and other non-target text, degrading quality." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)
HF chat templates give you the serialised byte sequence — e.g. <|user|>What is X?<|assistant|>X is ...<|end|> — but the template itself has no opinion on which of those token positions should contribute to the loss. That's the training-pipeline's job.
How the mask is constructed¶
Conceptually:
Practically, the pipeline needs to know where the assistant span begins and ends in the tokenised sequence. This typically requires either:
- Template-aware tokenisation: the chat-template and the tokenisation code collaborate to record assistant-span offsets.
- Generation-marker tokens: sentinel token IDs injected at template-application time that let the training pipeline recover assistant boundaries after tokenisation.
Netflix's framework uses a thin compatibility layer on top of Hugging Face AutoTokenizer — BaseHFModelTokenizer — that "handles post-training needs — setting padding tokens, injecting generation markers to support loss masking, and managing special tokens / semantic IDs — while ensuring the byte-level tokenization path matches production." This closes both the loss-masking and the tokenizer-skew gaps at the same layer.
Why it matters more than it looks¶
For production-grade LLM adaptation — instruction following, multi-turn dialogue, Chain-of-Thought — precisely controlling which tokens contribute to the loss is load-bearing. It's one of the three "getting the data right" pitfalls Netflix specifically enumerates:
- Loss masking (this page).
- Variable sequence length → async sequence packing.
- Document masking across packed samples → async sequence packing.
All three live in the dataloader, all three are easy to get wrong by omission, and all three silently degrade model quality if unaddressed.