Skip to content

NETFLIX 2025-04-08

Read original ↗

Netflix — How Netflix Accurately Attributes eBPF Flow Logs

Summary

Netflix describes how FlowCollector, the backend that consumes ~5M TCP flow-log records per second from per-host FlowExporter sidecars, was re-architected to eliminate IP misattribution in cloud eBPF flow logs. The previous design attributed both local and remote IPs from a discrete event stream of IP-address assignment/unassignment events produced by Sonar; delayed or out-of-order events produced 40% misattributed dependencies for Zuul. The new design splits the problem: each FlowExporter resolves its local IP to a workload identity at capture time from Metatron certs (EC2 path) or from an eBPF map populated by IPMan (container path), so every reported flow carries (local_ip, local_workload, start_ts, end_ts). FlowCollector accumulates these tuples into a per-IP list of ownership time ranges and broadcasts them to peer nodes via Kafka; remote IPs are resolved by time-range lookup against the broadcast map, keyed on the flow's start timestamp. The headline reframing is from "what does this IP own right now?" (event-based) to "what did this IP own in this time window?" (heartbeat-based) — a pattern that "handles transient issues gracefully — a few delayed or lost heartbeats do not lead to misattribution." Cross-regional flows resolve by forwarding to the peer region's FlowCollector via a CIDR trie built from all Netflix VPC CIDRs; non-workload IPs (AWS ELBs) fall back to the Sonar event stream because ELB reassignment is rare enough that misattribution is not a concern. FlowCollector runs on 30 c7i.2xlarge instances processing 5M flows/sec with only in-memory state (no persistent storage). Validated by comparing reconstructed Zuul dependencies against routing-config ground truth over a two-week window — zero misattribution, vs ~40% before.

Key takeaways

  1. Event-based IP attribution is architecturally broken at scale — heartbeat-based attribution with time ranges is the fix. The original system relied on "Sonar, an internal IP address tracking service that emits an event whenever an IP address in Netflix's AWS VPCs is assigned or unassigned to a workload. FlowCollector consumes a stream of IP address change events from Sonar and uses this information to attribute flow IP addresses in real-time." But "delays and failures are inevitable in distributed systems, which may delay IP address change events from reaching FlowCollector. For instance, an IP address may initially be assigned to workload X but later reassigned to workload Y. However, if the change event for this reassignment is delayed, FlowCollector will continue to assume that the IP address belongs to workload X, resulting in misattributed flows." The attempted mitigation (a 15-minute holdback) "reduced misattribution, did not eliminate it" and degraded freshness. The new design replaces discrete events with continuous heartbeats: every reported flow is itself a heartbeat of (local_ip, workload_id, t_start, t_end) ownership. "This new method achieves accurate attribution thanks to the continuous heartbeats, each associated with a reliable time range of IP address ownership. It handles transient issues gracefully — a few delayed or lost heartbeats do not lead to misattribution." Canonical wiki instance of discrete-event vs. heartbeat attribution as a distributed-systems primitive.
  2. Split local vs. remote attribution — attribute local in-kernel at capture time, attribute remote at the collector from broadcast heartbeats. The architectural asymmetry is load-bearing: "the local IP address belongs to the instance where FlowExporter captures the socket. Therefore, FlowExporter should determine the local workload identity from its environment and attribute the local IP address before sending the flow to FlowCollector." For EC2 workloads, Metatron provisions identity certs at boot; FlowExporter reads them from disk. For Titus containers, where one host runs many workloads, FlowExporter's eBPF programs look up the socket's local IP in an eBPF map that IPManAgent populates at container launch. Canonical wiki instance of sidecar-eBPF-flow-exporter
  3. eBPF-map-for-local-attribution.
  4. IPv6-to-IPv4 NAT translation breaks naive local attribution — disambiguated via (IP, port) key. "To facilitate IPv6 migration, Netflix developed a mechanism that enables IPv6-only containers to communicate with IPv4 destinations without incurring NAT64 overhead. This mechanism intercepts connect syscalls and replaces the underlying socket with one that uses a shared IPv4 address assigned to the container host. This confuses FlowExporter because the kernel reports the same local IPv4 address for sockets created by different container workloads. To disambiguate, local port information is additionally required. We modified Titus to write a mapping of (local IPv4 address, local port) to the workload ID into an eBPF map whenever a connect syscall is intercepted."
  5. Per-node broadcast of ownership time ranges via Kafka, not distributed-consensus. "The FlowCollector service cluster consists of many nodes. Every node must be capable of attributing arbitrary remote IP addresses and, therefore, requires knowledge of all workload IP addresses and their recent ownership records." "Since each flow is only sent to one FlowCollector node, each node must share the time ranges it learned from received flows with other nodes. We implemented a broadcasting mechanism using Kafka, where each node publishes learned time ranges to all other nodes. Although more efficient broadcasting implementations exist, the Kafka-based approach is simple and has worked well for us." Canonical wiki instance of Kafka-broadcast-for-shared-state as a deliberate choice over gossip / consensus for eventually- consistent cluster-shared state.
  6. Time-range lookup with timestamp — accept unattributed, never misattribute. "FlowCollector can attribute remote IP addresses by looking them up in the populated map, which returns a list of time ranges. It then uses the flow's start timestamp to determine the corresponding time range and associated workload identity. If the start time does not fall within any time range, FlowCollector will retry after a delay, eventually giving up if the retry fails. Such failures may occur when flows are lost or broadcast messages are delayed. For our use cases, it is acceptable to leave a small percentage of flows unattributed, but any misattribution is unacceptable." Canonical wiki instance of accept-unattributed-flows as a design posture: correctness-vs-coverage tradeoff resolved in favour of correctness. Timestamps rely on Amazon Time Sync across the fleet — sub-millisecond clock accuracy makes wall-clock time ranges reliable as attribution keys.
  7. 1-minute disk buffer replaces the 15-minute holdback. "When FlowCollector receives a flow, it cannot attribute its remote IP address right away because it requires the latest observed time ranges for the remote IP address. Since FlowExporter reports flows in batches every minute, FlowCollector must wait until it receives the flow batch from the remote workload FlowExporter for the last minute, which may not have arrived yet. To address this, FlowCollector temporarily stores received flows on disk for one minute before attributing their remote IP addresses. This introduces a 1-minute delay, but it is much shorter than the 15-minute delay with the previous approach."
  8. 30 c7i.2xlarge handles 5M flows/sec with only in-memory state. "In addition to producing accurate attribution, the new method is also cost-effective thanks to its simplicity and in-memory lookups. Because the in-memory state can be quickly rebuilt when a FlowCollector node starts up, no persistent storage is required. With 30 c7i.2xlarge instances, we can process 5 million flows per second for the entire Netflix fleet." The disposable- state property falls out of the heartbeat design — every new flow restores ownership history for its IP, so a node can cold-start and rebuild its map from incoming + broadcast traffic within minutes without durability infrastructure.
  9. Cross-regional attribution uses a CIDR trie + cross-region forwarding, not global broadcast. "For simplicity, we have so far glossed over one topic: regionalization. Netflix's cloud microservices operate across multiple AWS regions. To optimize flow reporting and minimize cross-regional traffic, a FlowCollector cluster runs in each major region, and FlowExporter agents send flows to their corresponding regional FlowCollector. When FlowCollector receives a flow, its local IP address is guaranteed to be within the region. To minimize cross-region traffic, the broadcasting mechanism is limited to FlowCollector nodes within the same region." Cross-regional flows are forwarded: "the receiving FlowCollector node forwards them to nodes in the corresponding region. FlowCollector determines the region for a remote IP address by looking up a trie built from all Netflix VPC CIDRs. This approach is more efficient than broadcasting IP address time range updates across all regions, as only 1% of Netflix flows are cross-regional." Canonical wiki instance of patterns/regional-forwarding-on-cidr-trie + concepts/cross-regional-attribution-trie.
  10. Non-workload IPs (ELBs) still use Sonar. "For these flows, their remote IP addresses are associated with the ELBs, where we cannot run FlowExporter. Consequently, FlowCollector cannot determine their identities by simply observing the received flows. To attribute these remote IP addresses, we continue to use IP address change events from Sonar, which crawls AWS resources to detect changes in IP address assignments. Although this data stream may contain inaccurate timestamps and be delayed, misattribution is not a main concern since ELB IP address reassignment occurs very infrequently." Honest hybrid posture — keep the old attribution mechanism for the subset of the address space where it still works.
  11. Validated by Zuul — 40% misattributed dependencies → zero. "Netflix's cloud gateway, Zuul, served this purpose perfectly due to its extensive footprint (handling all cloud ingress traffic), its large number of downstream dependencies, and our ability to derive its dependencies from its routing configurations as the source of truth for comparison with flow logs. We found no misattribution for flows through Zuul over a two-week window. This provided strong confidence that the new attribution method has eliminated misattribution. In the previous approach, approximately 40% of Zuul's dependencies reported by the flow logs were misattributed." The 40% datum on the old system + zero on the new system is the load-bearing before/after evidence.

Systems extracted

  • systems/netflix-flowexporter — per-host sidecar; eBPF + TCP tracepoints monitor socket state changes; on close emits (local_ip, local_workload, remote_ip, ports, t_start, t_end, stats). ~5M records/sec fleet-wide. Reports in 1-minute batches.
  • systems/netflix-flowcollector — regional backend service on 30 c7i.2xlarge; maintains in-memory per-IP list of (workload_id, start, end) time ranges; time-range lookup on the flow's start_ts to attribute remote IPs; Kafka broadcast to peer nodes; cross-region forwarding by CIDR-trie; 1-minute disk buffer for remote attribution; falls back to Sonar for ELB IPs.
  • systems/netflix-ipman — container IP assignment service. IPManAgent daemon on every container host writes the IP-address-to-workload-ID (and (local IPv4, local port) → workload for NAT'd sockets) mapping into an eBPF map that FlowExporter programs read.
  • systems/netflix-metatron — Netflix's workload identity service; provisions identity certificates to EC2 instances at boot, read by FlowExporter to resolve local workload identity.
  • systems/netflix-sonar — IP address tracking service; emits discrete assignment/unassignment events by crawling AWS resources. Formerly the sole IP→workload source; now retained only for the ELB subset where heartbeat-based attribution is impossible.
  • systems/netflix-zuul — Netflix cloud gateway; used as the ground-truth validation target (dependencies derivable from routing config; large downstream fan-out; 40% misattribution baseline).
  • systems/netflix-data-mesh — Netflix's data-movement and stream/batch processing platform; downstream of FlowCollector.
  • systems/netflix-titus — Netflix container platform; host of FlowExporter for container workloads; the IPMan + NAT64 shared-IP interactions are all Titus-specific.
  • systems/ebpf · systems/kafka — substrate primitives.

Concepts extracted

Patterns extracted

Operational numbers

Metric Value
Flow records produced ~5M/sec fleet-wide
FlowCollector fleet 30 × c7i.2xlarge
FlowExporter batch interval 1 min
Remote attribution disk buffer 1 min
Previous holdback (event-based) 15 min
Zuul dependency misattribution (old system) ~40%
Zuul dependency misattribution (new system, 2-week validation) 0
Cross-regional fraction of flows ~1%
Persistent storage none (in-memory; rebuilt from heartbeats)

Caveats

  • Architecture + design post, not a full retrospective. Kafka topic shape, broadcast rate, per-IP time-range retention, NAT64 connect-hook depth, and IPManAgent failure recovery are not disclosed.
  • Validation method is narrow. Zuul is exemplary because its dependencies are derivable from routing config, but the post does not name other validation targets or fleet-wide statistical bounds on misattribution; "zero misattribution" is stated for Zuul over two weeks, not proven globally.
  • Cross-region mechanism is a forwarding hop. Cross-regional flows incur one extra in-cluster forward; post does not disclose its latency or failure mode.
  • Non-workload attribution still depends on Sonar. ELB IPs fall back to the old discrete-event pipeline; any non-ELB endpoint where FlowExporter cannot run (AWS-managed endpoints, external dependencies) is implicitly unattributed or Sonar-attributed.
  • Clock skew is a load-bearing assumption. Sub-ms Amazon Time Sync makes wall-clock time ranges reliable; the paper does not quantify failure modes when a host's clock drifts significantly.
  • "Heartbeat" is the wiki framing. The post uses "continuous heartbeats" once in the concluding paragraph; the rest of the post uses "flow". The heartbeat-based framing captures the structural shift from event streams to continuous beacons as an attribution primitive.

Source

Last updated · 319 distilled / 1,201 read