Netflix — How Netflix Accurately Attributes eBPF Flow Logs¶
Summary¶
Netflix describes how FlowCollector,
the backend that consumes ~5M TCP flow-log records per second from
per-host FlowExporter sidecars,
was re-architected to eliminate IP misattribution in cloud
eBPF flow logs. The previous design attributed both
local and remote IPs from a discrete event stream of IP-address
assignment/unassignment events produced by
Sonar; delayed or
out-of-order events produced 40% misattributed dependencies for
Zuul. The new design splits the problem: each
FlowExporter resolves its local IP to a
workload identity at capture time from
Metatron certs (EC2 path) or from an
eBPF map populated by
IPMan (container path), so every reported
flow carries (local_ip, local_workload, start_ts, end_ts).
FlowCollector accumulates these tuples into a per-IP list of
ownership time ranges and
broadcasts them to peer
nodes via Kafka; remote IPs are resolved by
time-range lookup against the broadcast map, keyed on the flow's
start timestamp. The headline reframing is from "what does this IP
own right now?" (event-based) to "what did this IP own in this time
window?" (heartbeat-based) — a pattern that "handles transient
issues gracefully — a few delayed or lost heartbeats do not lead to
misattribution." Cross-regional flows resolve by forwarding to
the peer region's FlowCollector via a CIDR trie built from all
Netflix VPC CIDRs; non-workload IPs (AWS ELBs) fall back to the
Sonar event stream because ELB reassignment is rare enough that
misattribution is not a concern. FlowCollector runs on 30
c7i.2xlarge instances processing 5M flows/sec with only in-memory
state (no persistent storage). Validated by comparing reconstructed
Zuul dependencies against routing-config ground truth over a two-week
window — zero misattribution, vs ~40% before.
Key takeaways¶
- Event-based IP attribution is architecturally broken at scale
— heartbeat-based attribution with time ranges is the fix. The
original system relied on "Sonar, an internal IP address tracking
service that emits an event whenever an IP address in Netflix's
AWS VPCs is assigned or unassigned to a workload. FlowCollector
consumes a stream of IP address change events from Sonar and uses
this information to attribute flow IP addresses in real-time."
But "delays and failures are inevitable in distributed systems,
which may delay IP address change events from reaching
FlowCollector. For instance, an IP address may initially be
assigned to workload X but later reassigned to workload Y.
However, if the change event for this reassignment is delayed,
FlowCollector will continue to assume that the IP address belongs
to workload X, resulting in misattributed flows." The attempted
mitigation (a 15-minute holdback) "reduced misattribution, did
not eliminate it" and degraded freshness. The new design replaces
discrete events with continuous heartbeats: every reported
flow is itself a heartbeat of
(local_ip, workload_id, t_start, t_end)ownership. "This new method achieves accurate attribution thanks to the continuous heartbeats, each associated with a reliable time range of IP address ownership. It handles transient issues gracefully — a few delayed or lost heartbeats do not lead to misattribution." Canonical wiki instance of discrete-event vs. heartbeat attribution as a distributed-systems primitive. - Split local vs. remote attribution — attribute local in-kernel at capture time, attribute remote at the collector from broadcast heartbeats. The architectural asymmetry is load-bearing: "the local IP address belongs to the instance where FlowExporter captures the socket. Therefore, FlowExporter should determine the local workload identity from its environment and attribute the local IP address before sending the flow to FlowCollector." For EC2 workloads, Metatron provisions identity certs at boot; FlowExporter reads them from disk. For Titus containers, where one host runs many workloads, FlowExporter's eBPF programs look up the socket's local IP in an eBPF map that IPManAgent populates at container launch. Canonical wiki instance of sidecar-eBPF-flow-exporter
- eBPF-map-for-local-attribution.
- IPv6-to-IPv4 NAT translation breaks naive local attribution —
disambiguated via (IP, port) key. "To facilitate IPv6
migration, Netflix developed a mechanism that enables IPv6-only
containers to communicate with IPv4 destinations without incurring
NAT64 overhead. This mechanism intercepts
connectsyscalls and replaces the underlying socket with one that uses a shared IPv4 address assigned to the container host. This confuses FlowExporter because the kernel reports the same local IPv4 address for sockets created by different container workloads. To disambiguate, local port information is additionally required. We modified Titus to write a mapping of (local IPv4 address, local port) to the workload ID into an eBPF map whenever aconnectsyscall is intercepted." - Per-node broadcast of ownership time ranges via Kafka, not distributed-consensus. "The FlowCollector service cluster consists of many nodes. Every node must be capable of attributing arbitrary remote IP addresses and, therefore, requires knowledge of all workload IP addresses and their recent ownership records." "Since each flow is only sent to one FlowCollector node, each node must share the time ranges it learned from received flows with other nodes. We implemented a broadcasting mechanism using Kafka, where each node publishes learned time ranges to all other nodes. Although more efficient broadcasting implementations exist, the Kafka-based approach is simple and has worked well for us." Canonical wiki instance of Kafka-broadcast-for-shared-state as a deliberate choice over gossip / consensus for eventually- consistent cluster-shared state.
- Time-range lookup with timestamp — accept unattributed, never misattribute. "FlowCollector can attribute remote IP addresses by looking them up in the populated map, which returns a list of time ranges. It then uses the flow's start timestamp to determine the corresponding time range and associated workload identity. If the start time does not fall within any time range, FlowCollector will retry after a delay, eventually giving up if the retry fails. Such failures may occur when flows are lost or broadcast messages are delayed. For our use cases, it is acceptable to leave a small percentage of flows unattributed, but any misattribution is unacceptable." Canonical wiki instance of accept-unattributed-flows as a design posture: correctness-vs-coverage tradeoff resolved in favour of correctness. Timestamps rely on Amazon Time Sync across the fleet — sub-millisecond clock accuracy makes wall-clock time ranges reliable as attribution keys.
- 1-minute disk buffer replaces the 15-minute holdback. "When FlowCollector receives a flow, it cannot attribute its remote IP address right away because it requires the latest observed time ranges for the remote IP address. Since FlowExporter reports flows in batches every minute, FlowCollector must wait until it receives the flow batch from the remote workload FlowExporter for the last minute, which may not have arrived yet. To address this, FlowCollector temporarily stores received flows on disk for one minute before attributing their remote IP addresses. This introduces a 1-minute delay, but it is much shorter than the 15-minute delay with the previous approach."
- 30 c7i.2xlarge handles 5M flows/sec with only in-memory state. "In addition to producing accurate attribution, the new method is also cost-effective thanks to its simplicity and in-memory lookups. Because the in-memory state can be quickly rebuilt when a FlowCollector node starts up, no persistent storage is required. With 30 c7i.2xlarge instances, we can process 5 million flows per second for the entire Netflix fleet." The disposable- state property falls out of the heartbeat design — every new flow restores ownership history for its IP, so a node can cold-start and rebuild its map from incoming + broadcast traffic within minutes without durability infrastructure.
- Cross-regional attribution uses a CIDR trie + cross-region forwarding, not global broadcast. "For simplicity, we have so far glossed over one topic: regionalization. Netflix's cloud microservices operate across multiple AWS regions. To optimize flow reporting and minimize cross-regional traffic, a FlowCollector cluster runs in each major region, and FlowExporter agents send flows to their corresponding regional FlowCollector. When FlowCollector receives a flow, its local IP address is guaranteed to be within the region. To minimize cross-region traffic, the broadcasting mechanism is limited to FlowCollector nodes within the same region." Cross-regional flows are forwarded: "the receiving FlowCollector node forwards them to nodes in the corresponding region. FlowCollector determines the region for a remote IP address by looking up a trie built from all Netflix VPC CIDRs. This approach is more efficient than broadcasting IP address time range updates across all regions, as only 1% of Netflix flows are cross-regional." Canonical wiki instance of patterns/regional-forwarding-on-cidr-trie + concepts/cross-regional-attribution-trie.
- Non-workload IPs (ELBs) still use Sonar. "For these flows, their remote IP addresses are associated with the ELBs, where we cannot run FlowExporter. Consequently, FlowCollector cannot determine their identities by simply observing the received flows. To attribute these remote IP addresses, we continue to use IP address change events from Sonar, which crawls AWS resources to detect changes in IP address assignments. Although this data stream may contain inaccurate timestamps and be delayed, misattribution is not a main concern since ELB IP address reassignment occurs very infrequently." Honest hybrid posture — keep the old attribution mechanism for the subset of the address space where it still works.
- Validated by Zuul — 40% misattributed dependencies → zero. "Netflix's cloud gateway, Zuul, served this purpose perfectly due to its extensive footprint (handling all cloud ingress traffic), its large number of downstream dependencies, and our ability to derive its dependencies from its routing configurations as the source of truth for comparison with flow logs. We found no misattribution for flows through Zuul over a two-week window. This provided strong confidence that the new attribution method has eliminated misattribution. In the previous approach, approximately 40% of Zuul's dependencies reported by the flow logs were misattributed." The 40% datum on the old system + zero on the new system is the load-bearing before/after evidence.
Systems extracted¶
- systems/netflix-flowexporter — per-host sidecar; eBPF + TCP
tracepoints monitor socket state changes; on close emits
(local_ip, local_workload, remote_ip, ports, t_start, t_end, stats). ~5M records/sec fleet-wide. Reports in 1-minute batches. - systems/netflix-flowcollector — regional backend service on 30
c7i.2xlarge; maintains in-memory per-IP list of
(workload_id, start, end)time ranges; time-range lookup on the flow's start_ts to attribute remote IPs; Kafka broadcast to peer nodes; cross-region forwarding by CIDR-trie; 1-minute disk buffer for remote attribution; falls back to Sonar for ELB IPs. - systems/netflix-ipman — container IP assignment service.
IPManAgent daemon on every container host writes the
IP-address-to-workload-ID (and
(local IPv4, local port)→ workload for NAT'd sockets) mapping into an eBPF map that FlowExporter programs read. - systems/netflix-metatron — Netflix's workload identity service; provisions identity certificates to EC2 instances at boot, read by FlowExporter to resolve local workload identity.
- systems/netflix-sonar — IP address tracking service; emits discrete assignment/unassignment events by crawling AWS resources. Formerly the sole IP→workload source; now retained only for the ELB subset where heartbeat-based attribution is impossible.
- systems/netflix-zuul — Netflix cloud gateway; used as the ground-truth validation target (dependencies derivable from routing config; large downstream fan-out; 40% misattribution baseline).
- systems/netflix-data-mesh — Netflix's data-movement and stream/batch processing platform; downstream of FlowCollector.
- systems/netflix-titus — Netflix container platform; host of FlowExporter for container workloads; the IPMan + NAT64 shared-IP interactions are all Titus-specific.
- systems/ebpf · systems/kafka — substrate primitives.
Concepts extracted¶
- concepts/ip-attribution — the problem: map IP + time → workload.
- concepts/discrete-event-vs-heartbeat-attribution — the core design lesson: event-based attribution assumes a complete + ordered event stream that distributed systems do not deliver; heartbeat- based attribution fails gracefully on transient loss or delay.
- concepts/heartbeat-based-ownership — represent ownership as
per-IP append-only list of non-overlapping
(workload, start, end)time ranges, refreshed by every new flow. - concepts/workload-identity — machine-level identity provisioned by Metatron / IPMan; resolved locally at capture time.
- concepts/tcp-tracepoint — the kernel hook points FlowExporter attaches to; socket close fires the flow-log record.
- concepts/amazon-time-sync-attribution — the reliability of wall-clock time ranges as attribution keys depends on sub-ms clock sync across the fleet; Amazon Time Sync is the load-bearing substrate.
- concepts/cross-regional-attribution-trie — VPC CIDR trie as the fast dispatch primitive for forwarding cross-regional flows to the right regional cluster.
Patterns extracted¶
- patterns/heartbeat-derived-ip-ownership-map — the canonical pattern: per-IP accumulating time-range map built from flow heartbeats; broadcast to peers; time-range lookup at attribution time; disposable in-memory state.
- patterns/sidecar-ebpf-flow-exporter — per-host sidecar attached to TCP tracepoints emitting flow records with local identity pre-resolved.
- patterns/ebpf-map-for-local-attribution — eBPF map populated from userspace (IPManAgent, Titus connect-hook) read in-kernel by the BPF program during the socket event to resolve workload identity without a syscall or RPC.
- patterns/kafka-broadcast-for-shared-state — Kafka as a simple cluster-broadcast bus for eventually-consistent shared state when consensus is overkill.
- patterns/regional-forwarding-on-cidr-trie — CIDR trie over all regional address space + forward cross-region requests rather than globally broadcast state.
- patterns/accept-unattributed-flows — design posture: correctness over coverage; a small percentage unattributed is acceptable, any misattribution is not.
Operational numbers¶
| Metric | Value |
|---|---|
| Flow records produced | ~5M/sec fleet-wide |
| FlowCollector fleet | 30 × c7i.2xlarge |
| FlowExporter batch interval | 1 min |
| Remote attribution disk buffer | 1 min |
| Previous holdback (event-based) | 15 min |
| Zuul dependency misattribution (old system) | ~40% |
| Zuul dependency misattribution (new system, 2-week validation) | 0 |
| Cross-regional fraction of flows | ~1% |
| Persistent storage | none (in-memory; rebuilt from heartbeats) |
Caveats¶
- Architecture + design post, not a full retrospective. Kafka topic shape, broadcast rate, per-IP time-range retention, NAT64 connect-hook depth, and IPManAgent failure recovery are not disclosed.
- Validation method is narrow. Zuul is exemplary because its dependencies are derivable from routing config, but the post does not name other validation targets or fleet-wide statistical bounds on misattribution; "zero misattribution" is stated for Zuul over two weeks, not proven globally.
- Cross-region mechanism is a forwarding hop. Cross-regional flows incur one extra in-cluster forward; post does not disclose its latency or failure mode.
- Non-workload attribution still depends on Sonar. ELB IPs fall back to the old discrete-event pipeline; any non-ELB endpoint where FlowExporter cannot run (AWS-managed endpoints, external dependencies) is implicitly unattributed or Sonar-attributed.
- Clock skew is a load-bearing assumption. Sub-ms Amazon Time Sync makes wall-clock time ranges reliable; the paper does not quantify failure modes when a host's clock drifts significantly.
- "Heartbeat" is the wiki framing. The post uses "continuous heartbeats" once in the concluding paragraph; the rest of the post uses "flow". The heartbeat-based framing captures the structural shift from event streams to continuous beacons as an attribution primitive.
Source¶
- Original: https://netflixtechblog.com/how-netflix-accurately-attributes-ebpf-flow-logs-afe6d644a3bc
- Raw markdown:
raw/netflix/2025-04-08-how-netflix-accurately-attributes-ebpf-flow-logs-4d38347a.md - HN discussion: news.ycombinator.com/item?id=43624888 (160 points)
Related¶
- companies/netflix — Netflix TechBlog is Tier-1 on this wiki.
- sources/2024-09-11-netflix-noisy-neighbor-detection-with-ebpf — prior Netflix eBPF-at-Titus ingest; canonical scheduler-layer noisy-neighbor instance; shares the Titus + eBPF-map + atlas-emission stack.
- systems/ebpf · systems/netflix-titus · systems/kafka — substrate primitives.
- systems/netflix-flowexporter · systems/netflix-flowcollector · systems/netflix-ipman · systems/netflix-metatron · systems/netflix-sonar · systems/netflix-zuul · systems/netflix-data-mesh — new system pages established by this ingest.
- concepts/discrete-event-vs-heartbeat-attribution · concepts/heartbeat-based-ownership · concepts/ip-attribution · concepts/workload-identity · concepts/tcp-tracepoint · concepts/amazon-time-sync-attribution · concepts/cross-regional-attribution-trie — new concept pages.
- patterns/heartbeat-derived-ip-ownership-map · patterns/sidecar-ebpf-flow-exporter · patterns/ebpf-map-for-local-attribution · patterns/kafka-broadcast-for-shared-state · patterns/regional-forwarding-on-cidr-trie · patterns/accept-unattributed-flows — new pattern pages.