PATTERN Cited by 1 source
Heartbeat-derived IP ownership map¶
Maintain a per-IP list of non-overlapping
(workload_id, t_start, t_end) time ranges populated entirely
from data-plane heartbeats. Attribution of a flow's remote IP is a
time-range lookup keyed on the flow's start timestamp. No
event-stream from a control plane is required for the workload-IP
axis.
Structure¶
class IPOwnershipMap {
Map<IPAddr, SortedList<TimeRange>> map; // in-memory, per-node
struct TimeRange {
WorkloadId owner;
Timestamp t_start;
Timestamp t_end; // extended by each heartbeat
}
}
Invariants:
- Ranges per IP are sorted ascending by
t_startand non- overlapping (an IP can only belong to one workload at a time). - Every arriving heartbeat either extends the trailing range (same owner) or appends a new range (different owner).
When to use¶
- You can observe or produce a steady stream of
(resource, owner, t_start, t_end)tuples from the data plane — i.e. every business operation already implies ownership at its moment of capture. - Misattribution is more costly than unattribution.
- You want the attribution service to be stateless enough to cold-start without persistent storage.
When not to use¶
- Heartbeat frequency is too low to produce useful ownership coverage (e.g. ownership changes faster than heartbeats).
- The resource cannot be observed at the endpoint that owns it (e.g. AWS ELBs from outside the ELB layer). For these, keep an event-based fallback.
Canonical example¶
systems/netflix-flowcollector in Netflix's 2025 eBPF flow-log
attribution redesign. Every TCP flow close emitted by
systems/netflix-flowexporter carries
(local_ip, local_workload_id, t_start, t_end) — each such record
is simultaneously a business flow log and a heartbeat extending the
local IP's current-owner time range in FlowCollector's map. Remote
IPs are attributed by looking up their map and picking the time
range whose interval contains the flow's t_start. 5M flows/sec
processed on 30 c7i.2xlarge instances with no persistent storage;
2-week Zuul validation window showed zero misattribution vs. ~40%
under the prior event-based design.
Related primitives¶
- concepts/heartbeat-based-ownership — the underlying data structure + invariants.
- concepts/discrete-event-vs-heartbeat-attribution — the structural comparison vs. event-stream attribution.
- patterns/sidecar-ebpf-flow-exporter — the producer of heartbeats at the data plane.
- patterns/kafka-broadcast-for-shared-state — how nodes share their per-IP time ranges cluster-wide.
- patterns/accept-unattributed-flows — the design posture that makes the correctness-over-coverage tradeoff explicit.
Trade-offs¶
- Latency: attribution cannot happen until heartbeats covering the lookup window have arrived. Netflix buffers flows for 1 minute on disk to wait for the remote FlowExporter's next batch; the pre-redesign discrete-event system had a 15-minute holdback.
- Coverage gaps: an IP's very first flow heartbeat is unattributed on the receiving node until a peer broadcasts a time range that covers it. Netflix retries after a delay and gives up.
- Cost: in-memory state scales linearly with active IPs × recent time window; Netflix processes 5M flows/sec on 30 c7i.2xlarge.
- Cold start: disposable — new node rebuilds its map from incoming flows + Kafka-broadcast backlog within minutes.
Seen in¶
- sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs — canonical wiki instance; the headline architectural move of the post.