Skip to content

PATTERN Cited by 1 source

Heartbeat-derived IP ownership map

Maintain a per-IP list of non-overlapping (workload_id, t_start, t_end) time ranges populated entirely from data-plane heartbeats. Attribution of a flow's remote IP is a time-range lookup keyed on the flow's start timestamp. No event-stream from a control plane is required for the workload-IP axis.

Structure

class IPOwnershipMap {
    Map<IPAddr, SortedList<TimeRange>> map;   // in-memory, per-node
    struct TimeRange {
        WorkloadId owner;
        Timestamp t_start;
        Timestamp t_end;    // extended by each heartbeat
    }
}

Invariants:

  • Ranges per IP are sorted ascending by t_start and non- overlapping (an IP can only belong to one workload at a time).
  • Every arriving heartbeat either extends the trailing range (same owner) or appends a new range (different owner).

When to use

  • You can observe or produce a steady stream of (resource, owner, t_start, t_end) tuples from the data plane — i.e. every business operation already implies ownership at its moment of capture.
  • Misattribution is more costly than unattribution.
  • You want the attribution service to be stateless enough to cold-start without persistent storage.

When not to use

  • Heartbeat frequency is too low to produce useful ownership coverage (e.g. ownership changes faster than heartbeats).
  • The resource cannot be observed at the endpoint that owns it (e.g. AWS ELBs from outside the ELB layer). For these, keep an event-based fallback.

Canonical example

systems/netflix-flowcollector in Netflix's 2025 eBPF flow-log attribution redesign. Every TCP flow close emitted by systems/netflix-flowexporter carries (local_ip, local_workload_id, t_start, t_end) — each such record is simultaneously a business flow log and a heartbeat extending the local IP's current-owner time range in FlowCollector's map. Remote IPs are attributed by looking up their map and picking the time range whose interval contains the flow's t_start. 5M flows/sec processed on 30 c7i.2xlarge instances with no persistent storage; 2-week Zuul validation window showed zero misattribution vs. ~40% under the prior event-based design.

Trade-offs

  • Latency: attribution cannot happen until heartbeats covering the lookup window have arrived. Netflix buffers flows for 1 minute on disk to wait for the remote FlowExporter's next batch; the pre-redesign discrete-event system had a 15-minute holdback.
  • Coverage gaps: an IP's very first flow heartbeat is unattributed on the receiving node until a peer broadcasts a time range that covers it. Netflix retries after a delay and gives up.
  • Cost: in-memory state scales linearly with active IPs × recent time window; Netflix processes 5M flows/sec on 30 c7i.2xlarge.
  • Cold start: disposable — new node rebuilds its map from incoming flows + Kafka-broadcast backlog within minutes.

Seen in

Last updated · 319 distilled / 1,201 read