Skip to content

SYSTEM Cited by 3 sources

Netflix Titus

Titus is Netflix's centralized container-management platform — the canonical compute backend at Netflix, including for Metaflow jobs (targeted with the @titus decorator). Titus was open-sourced in 2018 (netflix/titus).

Relationship to Kubernetes

"Under the hood, Titus is powered by Kubernetes, but it provides a thick layer of enhancements over off-the-shelf Kubernetes, to make it more observable, secure, scalable, and cost-efficient" (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix). The named enhancement axes (from post-linked Netflix Tech Blog content) are:

  • Observability"Kubernetes and kernel panics" framing.
  • Security — container security with Linux user namespaces (see concepts/linux-namespaces).
  • Scalability — auto-scaling production services on Titus.
  • Cost-efficiency — predictive CPU isolation of containers.

How Metaflow uses it

Open-source Metaflow users target AWS Batch or Kubernetes as the compute backend. Inside Netflix the same role is played by Titus through the @titus decorator: "Metaflow tasks benefit from these battle-hardened features out of the box, with no in-depth technical knowledge or engineering required from the ML engineers or data scientist end."

Dependency management (packaging + rehydrating each Metaflow step's execution environment reproducibly in a remote pod) is handled by Metaflow's @conda / @pypi decorators (and Netflix's portable execution environments extension for fetching envs at execution time rather than deploy time — see concepts/portable-execution-environment) so developers don't have to hand-manage Docker images.

Relationship to Titus Gateway

The read-side API tier that fronts Titus for internal clients is Titus Gateway — its consistent-caching rebuild is covered separately.

Observability — per-container run-queue-latency monitoring

Titus hosts run Netflix's eBPF-based runq.latency monitor: attached to tp_btf/sched_wakeup + tp_btf/sched_switch, it derives per-task run-queue latency in-kernel, tags samples with cgroup ID, rate-limits per- cgroup-per-CPU, and emits per-container Atlas percentile timers + preempt-cause- tagged counters. The dual-metric output is explicitly designed to tell cross-cgroup noisy neighbors from self CFS-quota throttling (see concepts/cpu-throttling-vs-noisy-neighbor). The team's eBPF- over-kernel-module rationale: "While implementing this with a kernel module was feasible, we leveraged eBPF for its safety and flexibility" (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf).

Per-workload TCP flow attribution — IPMan + FlowExporter

Titus hosts also run FlowExporter, Netflix's eBPF sidecar that attaches to TCP tracepoints and emits flow-log records on socket close (~5M records/sec fleet-wide). On Titus hosts, many container workloads share the host kernel, so FlowExporter needs per-socket workload identity resolution. IPManAgent, a per-host daemon, writes the container IP → workload-ID mapping into an eBPF map that FlowExporter's BPF programs read in-kernel during tracepoint handling (patterns/ebpf-map-for-local-attribution).

A second eBPF map — written by Titus on each intercepted connect syscall — keys on (local IPv4 address, local port) to disambiguate sockets created by Netflix's IPv6-to-IPv4 translation mechanism without NAT64 overhead. That mechanism intercepts connect and replaces the IPv6 socket with one using a shared host IPv4, which otherwise makes multiple containers' sockets indistinguishable by local IP alone.

Source: sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs.

Seen in

Last updated · 319 distilled / 1,201 read