SYSTEM Cited by 3 sources
Netflix Titus¶
Titus is Netflix's centralized container-management platform —
the canonical compute backend at Netflix, including for
Metaflow jobs (targeted with the @titus
decorator). Titus was open-sourced in 2018
(netflix/titus).
Relationship to Kubernetes¶
"Under the hood, Titus is powered by Kubernetes, but it provides a thick layer of enhancements over off-the-shelf Kubernetes, to make it more observable, secure, scalable, and cost-efficient" (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix). The named enhancement axes (from post-linked Netflix Tech Blog content) are:
- Observability — "Kubernetes and kernel panics" framing.
- Security — container security with Linux user namespaces (see concepts/linux-namespaces).
- Scalability — auto-scaling production services on Titus.
- Cost-efficiency — predictive CPU isolation of containers.
How Metaflow uses it¶
Open-source Metaflow users target AWS Batch or Kubernetes as the
compute backend. Inside Netflix the same role is played by Titus
through the @titus decorator: "Metaflow tasks benefit from these
battle-hardened features out of the box, with no in-depth technical
knowledge or engineering required from the ML engineers or data
scientist end."
Dependency management (packaging + rehydrating each Metaflow step's
execution environment reproducibly in a remote pod) is handled by
Metaflow's @conda / @pypi decorators (and Netflix's portable
execution environments extension for fetching envs at execution
time rather than deploy time — see
concepts/portable-execution-environment) so developers don't
have to hand-manage Docker images.
Relationship to Titus Gateway¶
The read-side API tier that fronts Titus for internal clients is Titus Gateway — its consistent-caching rebuild is covered separately.
Observability — per-container run-queue-latency monitoring¶
Titus hosts run Netflix's eBPF-based
runq.latency monitor: attached to
tp_btf/sched_wakeup + tp_btf/sched_switch, it derives per-task
run-queue latency in-kernel, tags
samples with cgroup ID, rate-limits per-
cgroup-per-CPU, and emits per-container
Atlas percentile timers + preempt-cause-
tagged counters. The dual-metric output is explicitly designed to
tell cross-cgroup noisy neighbors from
self CFS-quota throttling (see
concepts/cpu-throttling-vs-noisy-neighbor). The team's eBPF-
over-kernel-module rationale: "While implementing this with a kernel
module was feasible, we leveraged eBPF for its safety and
flexibility"
(Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf).
Per-workload TCP flow attribution — IPMan + FlowExporter¶
Titus hosts also run FlowExporter, Netflix's eBPF sidecar that attaches to TCP tracepoints and emits flow-log records on socket close (~5M records/sec fleet-wide). On Titus hosts, many container workloads share the host kernel, so FlowExporter needs per-socket workload identity resolution. IPManAgent, a per-host daemon, writes the container IP → workload-ID mapping into an eBPF map that FlowExporter's BPF programs read in-kernel during tracepoint handling (patterns/ebpf-map-for-local-attribution).
A second eBPF map — written by Titus on each intercepted connect
syscall — keys on (local IPv4 address, local port) to disambiguate
sockets created by Netflix's IPv6-to-IPv4 translation mechanism
without NAT64 overhead. That mechanism intercepts connect and
replaces the IPv6 socket with one using a shared host IPv4, which
otherwise makes multiple containers' sockets indistinguishable by
local IP alone.
Source: sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs.
Seen in¶
- sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix — Titus as the compute backend for Metaflow at Netflix.
- sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf — Titus as the fleet where the eBPF run-queue monitor is deployed.
- sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs
— Titus as the container-host substrate for per-workload TCP flow
log attribution; IPMan + eBPF-map interaction;
connect-hook(IPv4, port) → workloaddisambiguation for Netflix's NAT64-free IPv6-to-IPv4 mechanism. - Cross-reference only: sources/2022-12-02-highscalability-stuff-the-internet-says-on-scalability-for-december-2nd-2022 (Titus Gateway rebuild)
Related¶
- companies/netflix
- systems/metaflow · systems/titus-gateway · systems/kubernetes
- systems/ebpf · systems/netflix-runq-monitor · systems/netflix-atlas
- systems/netflix-flowexporter · systems/netflix-ipman
- concepts/linux-namespaces · concepts/container-escape
- concepts/linux-cgroup · concepts/run-queue-latency · concepts/noisy-neighbor
- concepts/workload-identity · concepts/ip-attribution
- patterns/ebpf-map-for-local-attribution