Skip to content

CONCEPT Cited by 1 source

Network intermediary resolution

Network intermediary resolution is the operation of collapsing multi-hop network flows that pass through intermediaries — load balancers, NAT gateways, API gateways, proxies — into the direct application-to-application relationship engineers actually want to see in a service dependency graph. Without it, a network-level capture system records App A → Load Balancer and Load Balancer → App B as two separate edges, and the dependency graph is contaminated with intermediaries appearing as fake terminal nodes between every pair of services that share a load balancer.

The problem statement

"Network flow logs only show individual network hops through intermediaries (App A → Load Balancer → App B, or App A → NAT Gateway → App B), not the true application-level connections we need (App A → App B)." (sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map)

The intermediaries are real network components doing useful work (routing, NAT, TLS termination, rate limiting) — but in the dependency graph engineer view, they are noise. A "who calls App B?" query that returns "the load balancer" is unhelpful; the engineer needs "App A, App C, App F."

Why this is hard:

  • Same intermediary, many service pairs. A single load balancer fronts hundreds of services × thousands of clients. The intermediary appears in millions of flow records per second across many independent service pairs.
  • State to combine inbound and outbound flows. The mapping "this inbound flow on the LB came from App A; this outbound flow from the LB went to App B" requires correlating two separately-captured flows by some join key (typically the intermediary's connection state, which IP/port pair the intermediary used, or session affinity).
  • Hot spots. Intermediaries see far more traffic than any individual application — "specific applications or network intermediaries see 100x more traffic than others." A naive per-intermediary aggregator becomes a bottleneck.

The Netflix mechanism

Resolution happens as Stage 2 of Netflix Service Topology's three-stage distributed aggregation pipeline:

"Stage 2 applies resolution logic — identifying network intermediaries (load balancers, NAT gateways, API gateways, proxies) and combining their incoming and outgoing flows to reconstruct direct application-to-application paths."

The key operational property: Stage 2 works in concert with Stages 1 and 3 (initial Kafka aggregation; final aggregation + health-status integration), so the hot-spot load on intermediaries doesn't sink the whole pipeline:

"This graduated approach also prevents hot spots by distributing load across multiple points even when specific applications or network intermediaries see 100x more traffic than others."

The post does not decompose Stage 2's join algorithm in detail (deferred to the engineering-deep-dive follow-up post).

What counts as an intermediary

The post explicitly enumerates four classes:

  • Load balancers — AWS ELB / ALB / NLB, internal LBs, sidecar proxies acting as L7 LBs.
  • NAT gateways — both AWS NAT Gateways and any host-level NAT hops.
  • API gateways — service-mesh ingress controllers, dedicated API-gateway services, etc.
  • Proxies — Envoy sidecars, dedicated forward proxies.

The classification likely lives in a registry / lookup table keyed off IP / DNS / metadata; the post doesn't describe the identification mechanism beyond "identifying network intermediaries." The Netflix FlowCollector is the upstream substrate that already does heartbeat-based ownership attribution per IP — likely the intermediary classification rides on the same registry.

Sibling: ELB fallback in FlowCollector

The 2025-04-08 FlowExporter/FlowCollector post discloses a related-but-distinct intermediary problem at the attribution layer: ELB IPs cannot be heartbeat-attributed because FlowExporter can't run on an ELB, so FlowCollector falls back to a discrete-event source (Sonar) for ELB-IP-to-workload mapping. (sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs)

The two intermediary problems compose:

  1. Attribution layer (FlowCollector): "this IP belongs to ELB X serving workload W" — answered via Sonar fallback.
  2. Topology layer (Service Topology Stage 2): "this two-hop flow App A → ELB X → App B should be represented as a single edge App A → App B in the dependency graph" — answered via intermediary resolution.

The first decides what the intermediary is; the second decides how to erase it from the engineer-visible graph.

Why this matters for the graph's usefulness

Without intermediary resolution, the dependency graph becomes:

  • Cluttered: every load balancer appears as a "node" between most service pairs.
  • Misleading: blast-radius queries return huge fan-outs through the LB rather than the actual upstream callers.
  • Wrong on root-cause questions: "is something I depend on broken?" lights up the LB, not the upstream service whose call into the LB is failing.

The whole point of a service dependency graph — that it answers engineer-level dependency questions — fails without intermediary resolution.

Seen in

Last updated · 542 distilled / 1,571 read