Skip to content

SYSTEM Cited by 1 source

Netflix Post-Training Framework

The Netflix Post-Training Framework is an internal Netflix AI Platform library that hides the distributed-systems complexity of adapting open-weight LLMs (Qwen3, Gemma3, Qwen3 MoE, GPT-OSS) to Netflix-specific post-training objectives — member personalisation, recommendations, search. It sits above Mako (Netflix's ML compute platform, which provisions AWS GPUs) and wraps PyTorch, Ray, and vLLM "largely out of the box." First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix.

Role in Netflix's ML stack

┌──────────────────────────────────────────────┐
│  Post-Training Framework (this page)         │  ← library: Data/Model/Compute/Workflow
├──────────────────────────────────────────────┤
│  PyTorch   +   Ray   +   vLLM   +   Verl     │  ← OSS, largely out-of-the-box
├──────────────────────────────────────────────┤
│  Mako (Netflix ML compute platform)          │  ← GPU provisioning on AWS
├──────────────────────────────────────────────┤
│  AWS GPU instances                           │
└──────────────────────────────────────────────┘

Users express jobs as configuration files that select a recipe and plug in task-specific components. Four standardised recipes ship:

Four-pillar component model

The framework's surface area is deliberately factored along four dimensions — the classical three ML-systems pillars (Data / Model / Compute) plus a fourth (Workflow) added specifically to handle multi-stage on-policy RL execution.

Data

  • Dataset abstractions for SFT, reward modeling, and RL.
  • High-throughput streaming from cloud + disk for datasets that exceed local storage.
  • Asynchronous on-the-fly sequence packing to overlap CPU packing with GPU execution — delivered up to 4.7× effective token throughput on the most sequence-length-skewed internal dataset (see patterns/on-the-fly-async-sequence-packing).
  • Explicit loss masking so only assistant tokens contribute to the loss (HF chat templates don't specify this).

Model

  • Support for Qwen3, Gemma3, Qwen3 MoE, GPT-OSS (current supported families).
  • LoRA integrated into model definitions, not bolted on.
  • High-level sharding APIs so developers distribute across device meshes without writing low-level FSDP or tensor parallel code.
  • Internal optimised model definitions (not direct transformers classes) that load/save checkpoints in Hugging Face format. This is what enables FlexAttention, memory-efficient chunked cross-entropy, consistent MFU accounting, and uniform LoRA extensibility across families. (patterns/huggingface-checkpoint-compat-for-internal-optimized-model)
  • A unified module naming convention so components (Attention, MLP, output heads) can be programmatically located and swapped across architectures.

Compute

  • Unified job submission interface: single node → hundreds of GPUs.
  • MFU monitoring that remains accurate under custom architectures and LoRA.
  • Comprehensive checkpointing — trained parameters, optimizer state, dataloader position, data-mixer state — for exact resumption after interruption.

Workflow

  • Supports SPMD SFT workflows unchanged.
  • Extends to on-policy RL (GRPO-style) via a hybrid single-controller + SPMD execution model. Integrates Verl's Ray-actor-lifecycle + GPU-resource-allocation backend rather than reinventing RL orchestration.

Design decisions captured in the post

1. SFT-SPMD → RL-hybrid execution shift

When the framework was first built, the team chose a pure SPMD model: a "thin" Ray-actor driver launched N identical actors, each running the full training loop; scaling meant more identical actors. With DeepSeek-R1 and GRPO-style on-policy RL in 2025, that assumption broke — the learning signal is sparse/delayed and depends on data generated by the current policy. RL requires decomposing into distinct roles (Policy, Rollout Workers, Reward Model, Reference Model) and evolving the driver into an active controller that encodes the control plane: when to rollout, how to batch+score, when to trigger optimization, how to manage cluster resources across phases. (concepts/single-controller-rl-orchestration)

2. Hugging Face-centric, strategically

  • AutoTokenizer as single source of truth. Early attempts to bind directly to SentencePiece/tiktoken produced silent training-serving tokenizer skew because vLLM defaults to HF AutoTokenizer — different token boundaries surfaced as "inexplicable quality regressions." Fix: BaseHFModelTokenizer thin compat layer on top of AutoTokenizer that injects padding tokens, generation markers for loss masking, and special tokens / semantic IDs.
  • HF-format checkpoints even with internal optimised model definitions — avoids walled-garden friction, lets teams pull in new architectures quickly.

3. Own the model impl, scale the porting with agents

Each new family needs a bridge from HF reference to Netflix's internal definition. The scaling mechanism: a logit verifier — given random inputs, the internal model must match HF logits within tolerance. Because the acceptance criterion is mechanically checkable, AI coding agents iterate autonomously until the implementation is correct. (patterns/logit-equivalence-as-agent-automation-gate)

4. Differential value via workload-specific perf wins

  • Vocab padding to multiples of 64: avoids cuBLAS → CUTLASS kernel fallback for the LM head (non-multiple-of-64 vocab sizes triggered ~3× layer execution time). Framework auto-pads. (patterns/vocab-pad-to-kernel-boundary)
  • >128K vocabulary memory trap: logits are [batch, seq_len, vocab], spiking peak memory. Mitigations in-framework: drop ignored tokens before projection; chunk logits/loss along the sequence dimension.
  • Precision correctness for RL: rollout precision and policy precision must align — specifically called out as "subtle" for RL workloads.

5. Non-standard transformer workloads are first-class

Some Netflix internal models train on member interaction event sequences rather than natural language — bespoke RL loops integrating with custom inference engines, optimising business-defined metrics. The framework accommodates these without fragmenting into one-off pipelines, preserving perf/tracking/fault-tolerance guarantees.

Scope / limits (per the post)

  • Only trains architectures Netflix has explicitly ported. A fallback Hugging Face backend is planned (similar to vLLM/SGLang/torchtitan patterns) so users can train directly on native transformers models for rapid exploration — at the cost of losing some framework optimisations.
  • No public benchmark of RL scaling numbers beyond the sequence-packing throughput figure.
  • No disclosure of which Netflix products have shipped on this framework.

Design lineage (credited)

"We're especially grateful to the teams and contributors behind Torchtune, Torchtitan, and Verl, whose reference implementations and design patterns informed many of our training framework choices."

The framework is thus an opinionated composition of:

  • Torchtune (post-training recipes)
  • Torchtitan (scalable training patterns)
  • Verl (RL-oriented distributed execution)
  • Ray (actor-based orchestration)
  • vLLM (inference and serving contract)
  • Mako (Netflix-owned compute substrate)

— plus Netflix's owned Data/Model/Compute/Workflow surface area with internal optimisations.

Source

Last updated · 550 distilled / 1,221 read