SYSTEM Cited by 3 sources

Netflix Titus¶

Titus is Netflix's centralized container-management platform — the canonical compute backend at Netflix, including for Metaflow jobs (targeted with the @titus decorator). Titus was open-sourced in 2018 (netflix/titus).

Relationship to Kubernetes¶

"Under the hood, Titus is powered by Kubernetes, but it provides a thick layer of enhancements over off-the-shelf Kubernetes, to make it more observable, secure, scalable, and cost-efficient" (Source: sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix). The named enhancement axes (from post-linked Netflix Tech Blog content) are:

Observability — "Kubernetes and kernel panics" framing.
Security — container security with Linux user namespaces (see concepts/linux-namespaces).
Scalability — auto-scaling production services on Titus.
Cost-efficiency — predictive CPU isolation of containers.

How Metaflow uses it¶

Open-source Metaflow users target AWS Batch or Kubernetes as the compute backend. Inside Netflix the same role is played by Titus through the @titus decorator: "Metaflow tasks benefit from these battle-hardened features out of the box, with no in-depth technical knowledge or engineering required from the ML engineers or data scientist end."

Dependency management (packaging + rehydrating each Metaflow step's execution environment reproducibly in a remote pod) is handled by Metaflow's @conda / @pypi decorators (and Netflix's portable execution environments extension for fetching envs at execution time rather than deploy time — see concepts/portable-execution-environment) so developers don't have to hand-manage Docker images.

Relationship to Titus Gateway¶

The read-side API tier that fronts Titus for internal clients is Titus Gateway — its consistent-caching rebuild is covered separately.

Observability — per-container run-queue-latency monitoring¶

Titus hosts run Netflix's eBPF-based runq.latency monitor: attached to tp_btf/sched_wakeup + tp_btf/sched_switch, it derives per-task run-queue latency in-kernel, tags samples with cgroup ID, rate-limits per- cgroup-per-CPU, and emits per-container Atlas percentile timers + preempt-cause- tagged counters. The dual-metric output is explicitly designed to tell cross-cgroup noisy neighbors from self CFS-quota throttling (see concepts/cpu-throttling-vs-noisy-neighbor). The team's eBPF- over-kernel-module rationale: "While implementing this with a kernel module was feasible, we leveraged eBPF for its safety and flexibility" (Source: sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf).

Per-workload TCP flow attribution — IPMan + FlowExporter¶

Titus hosts also run FlowExporter, Netflix's eBPF sidecar that attaches to TCP tracepoints and emits flow-log records on socket close (~5M records/sec fleet-wide). On Titus hosts, many container workloads share the host kernel, so FlowExporter needs per-socket workload identity resolution. IPManAgent, a per-host daemon, writes the container IP → workload-ID mapping into an eBPF map that FlowExporter's BPF programs read in-kernel during tracepoint handling (patterns/ebpf-map-for-local-attribution).

A second eBPF map — written by Titus on each intercepted connect syscall — keys on (local IPv4 address, local port) to disambiguate sockets created by Netflix's IPv6-to-IPv4 translation mechanism without NAT64 overhead. That mechanism intercepts connect and replaces the IPv6 socket with one using a shared host IPv4, which otherwise makes multiple containers' sockets indistinguishable by local IP alone.

Source: sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs.

Seen in¶

sources/2024-07-22-netflix-supporting-diverse-ml-systems-at-netflix — Titus as the compute backend for Metaflow at Netflix.
sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf — Titus as the fleet where the eBPF run-queue monitor is deployed.
sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs — Titus as the container-host substrate for per-workload TCP flow log attribution; IPMan + eBPF-map interaction; connect-hook (IPv4, port) → workload disambiguation for Netflix's NAT64-free IPv6-to-IPv4 mechanism.
Cross-reference only: sources/2022-12-02-highscalability-stuff-the-internet-says-on-scalability-for-december-2nd-2022 (Titus Gateway rebuild)