Skip to content

PATTERN Cited by 1 source

Thin library on top of OSS compute platform

Intent

Deliver ML-platform capabilities by building a thin, opinionated library on top of an off-the-shelf stack of open-source components (PyTorch + Ray + vLLM + Verl) sitting on a generic internal compute substrate — rather than building a bespoke internal-only ML platform that reinvents orchestration, storage, or serving. Concentrate engineering investment on differential-value surfaces (workload-specific performance, business-requirement integration) rather than platform plumbing.

First canonical wiki reference: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — the explicit design philosophy Netflix articulates for its Post-Training Framework.

Problem

ML-platform teams face a recurring decision: how much of the stack to own end-to-end.

  • Own everything: bespoke cluster manager, bespoke job scheduler, bespoke training library, bespoke inference stack, bespoke tokenizer. Maximum control; maximum headcount; drifts from the OSS community's progress.
  • Own nothing: adopt a vendor platform (SageMaker, Vertex AI, bespoke IaaS). Minimum control; no differential value possible.

Neither extreme is right for a team whose value-add is specialised adaptation (e.g. Netflix's member-interaction-sequence training workloads) that needs both framework-level optimisations and community-velocity ingestion of new models.

Solution

A three-layer stack:

┌──────────────────────────────────────────────────┐
│ Domain library (this team's value-add)           │
│  - Data/Model/Compute/Workflow abstractions       │
│  - Internal performance optimisations             │
│  - Business-requirement integration               │
├──────────────────────────────────────────────────┤
│ OSS compute stack (unmodified or lightly wrapped) │
│  - PyTorch + Ray + vLLM + Verl + HuggingFace      │
├──────────────────────────────────────────────────┤
│ Internal compute substrate (generic)              │
│  - GPU provisioning                               │
│  - AWS / DC networking                            │
└──────────────────────────────────────────────────┘

Netflix's framing:

"At the base is Mako, Netflix's internal ML compute platform, which provisions GPUs on AWS. On top of Mako, we run robust open-source components — PyTorch, Ray, and vLLM — largely out of the box. Our post-training framework sits above these foundations as a library: it provides reusable utilities and standardized training recipes for common workflows such as Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Reinforcement Learning (RL), and Knowledge Distillation. Users typically express jobs as configuration files that select a recipe and plug in task-specific components." (Source: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix)

Differential value surfaces

The library concentrates engineering on surfaces where off-the-shelf is weakest — explicitly stated by Netflix:

"A post-training framework is only worth owning if it delivers clear value beyond assembling OSS components. We build on open source for velocity, but we invest heavily where off-the-shelf tools tend to be weakest: performance tuned to our workload characteristics, and integration with Netflix-specific model and business requirements."

Concrete examples from the source:

  • Performance wins tuned to workload: async sequence packing (up to 4.7× throughput on the most skewed dataset), vocab padding to kernel boundaries (avoids 3× LM-head slowdown).
  • Non-standard transformer support: member-interaction-sequence models, custom output projection heads, bespoke RL loops integrated with custom inference engines.
  • Consistent cross-cutting abstractions: MFU accurate under custom architectures + LoRA; uniform LoRA extensibility across families.

Delegation pattern: use OSS for commodity layers

Where the OSS community has converged on good-enough abstractions, use them unchanged:

  • Ray for distributed workflow orchestration / actor lifecycle.
  • PyTorch for model definition and distributed collectives.
  • vLLM for inference (and as the tokenizer contract).
  • Verl for RL-specific distributed orchestration layering on top of Ray.
  • Hugging Face AutoTokenizer as the single source of truth for tokenization.
  • Hugging Face checkpoint format for interchange (patterns/huggingface-checkpoint-compat-for-internal-optimized-model).

Integrate rather than rewrite. When Verl's abstractions fit the RL-orchestration problem, use them; don't invent Netflix's own.

Applicability

  • ✅ Internal platform teams with differentiated workloads requiring framework-level performance tuning.
  • ✅ Teams whose value-add is in workload-specific optimisation or business-requirement integration, not in generic orchestration.
  • ✅ Ecosystems where OSS has converged on production-grade abstractions (Ray, PyTorch, vLLM) and is moving faster than any internal-only equivalent.
  • ❌ Teams whose value-add IS the orchestration layer (cloud vendors building ML PaaS products).
  • ❌ Use cases so thin there's no differential value to extract beyond the OSS baseline — use the OSS directly.

Trade-offs

Benefit Cost
Engineering concentrated on differential-value surface Depends on OSS API stability
Move with community velocity on new models/features Must keep up with OSS version churn
Library users get both OSS-ecosystem portability and internal optimisations Users have to understand both the library API and (sometimes) the OSS layer beneath
Small framework team can maintain it Bug fixes may require upstream OSS contributions

Known uses

Last updated · 550 distilled / 1,221 read