Skip to content

NETFLIX 2026-05-29 Tier 1

Read original ↗

Netflix — From Silos to Service Topology: Why Netflix Built a Real-Time Service Map

Summary

Tier-1 Netflix TechBlog architecture-disclosure post — first wiki canonicalisation of Service Topology, Netflix's real-time service-dependency graph of "thousands of microservices" delivering streaming, Live, and Ads-supported plans worldwide. The post is the first in a promised series; it stays at the architectural-shape layer (engineering challenges deferred to a follow-up). The novel framing is three parallel sources of trutheBPF network flows, IPC metrics from gRPC/GraphQL/REST instrumentation, and aggregated end-to-end traces — each producing a physically separate graph, queryable independently or merged in parallel at query time to produce a unified view (patterns/three-layer-graph-merge-on-query). Each layer compensates for the others' blind spots: eBPF gives ground- truth coverage but lacks application context; IPC metrics give rich endpoint/protocol detail but only for instrumented services; tracing gives actual runtime call paths but is sampled. The post also discloses two structural mechanisms: (1) a three-stage distributed aggregation pipeline built on Apache Pekko Streams running over multi- region Kafka — initial Kafka aggregation → network-intermediary resolution (load balancers, NAT gateways, API gateways, proxies collapsed so the stored edge is direct application-to-application) → final aggregation with health-status integration before persistence (patterns/three-stage-flow-aggregation-pipeline); and (2) time-window aggregation over the temporal graph — instead of storing every time slice, layer-specific aggregators accumulate across windows so historical views can be reconstructed efficiently without exploding storage (concepts/temporal-topology-query, patterns/time-window-aggregator-for-temporal-graph). Storage sits on top of Netflix's graph database for the eBPF and IPC layers, with columnar storage optimized for analytical queries for the tracing layer; access is via a gRPC service supporting multi-hop traversal, filtering by availability tier and business domain, pagination, and "sub-second query response times even when combining all three layers". The single concrete operational target named is sub-second multi-hop traversal; no QPS, fleet-wide flow- record rate, graph-size, or latency-distribution numbers are disclosed in this post (deferred to the follow-up). Four-year support- ticket mining surfaced the consistent pattern of dependency questions "What are my upstream and downstream dependencies?", "Is this failure in my service, or is something I depend on broken?", "Which services will be impacted if I take this down for maintenance?", "Why is this service showing as 'Unknown' in my metrics?", "What changed in my call path recently that could explain this behavior?" — forming the design brief. The strategic motivation is named: Live programming and Ads-supported plans "can't wait for lengthy incident investigations" — Service Topology is one of the substrates that makes faster-MTTR possible. Sibling to the [[sources/2025-04-08-netflix-how-netflix-accurately-attributes-ebpf-flow-logs|2025-04-08 FlowExporter / FlowCollector ingest]]: that post canonicalised the flow-attribution layer (resolving every flow to a workload identity); this post canonicalises what's built on top (the multi-source service-dependency graph + the gRPC API surface + programmatic blast-radius computation). Together the two posts span the bottom-up arc from kernel TCP tracepoint → workload identity → flow record → aggregated graph → engineer-facing UI + automated systems API. The post explicitly forward-references automated root-cause analysis: "an intelligent agent that continuously crawls the topology graph, correlates failures across dependencies, understands historical patterns, and surfaces likely root causes automatically. Service topology provides the knowledge graph foundation that makes this kind of intelligent automation possible."

Key takeaways

  1. Three parallel sources of truth, not one — Service Topology builds three physically separate graphs, each from a different capture mechanism, and merges them at query time only when the user asks for a unified view. (concepts/multi-source-topology-fusion)
  2. eBPF network flows. Captured at the kernel level on every host via FlowExporter sidecars. "Every service shows up here because we're capturing actual network traffic, regardless of whether applications are instrumented." Provides cluster-level and app-level topology but lacks application context"we know Service A connected to Service B's IP address using a specific protocol, but not which specific API endpoint or path was called."
  3. IPC (Inter-Process Communication) metrics. Emitted by applications when they call other services via gRPC, GraphQL, REST, or other protocols. "We can see which specific endpoints were called, error rates, latency distributions, protocol details, and request/response characteristics." Limited to instrumented services. (concepts/ipc-metrics)
  4. End-to-end tracing. Aggregated traces produce graph edges; individual traces can also be overlaid on the topology to show specific request flows. "Not just 'Service A can call Service B,' but 'Service A did call Service B as part of serving this specific member request.'" Sampled — "may miss rarely-used code paths in the aggregated view."

The structural insight: each layer's limitation is another layer's strength. "Network flows ensure completeness — we don't miss anything. IPC metrics provide application details — we understand the 'how' and 'what'. Tracing shows actual behavior — we see real request patterns. Each source compensates for the limitations of the others."

  1. Each source is a physically separate graph; merging is a parallel-query operation, not a write-time fusion. (patterns/three-layer-graph-merge-on-query) Verbatim: "Each source creates its own graph that is physically separate — the network layer in one graph database partition, the IPC layer in another partition, and the tracing layer using columnar storage optimized for analytical queries. This physical separation allows each layer to evolve independently and be queried in parallel. When users request a unified view, we execute traversal queries across all layers simultaneously and merge results, achieving sub-second response times even when combining all three layers." This is a deliberate choice over write-time fusion: the layers can have different storage shapes (graph DB partitions for network/IPC; columnar for traces), different update cadences, and different ownership boundaries.

  2. Three-stage distributed aggregation collapses network intermediaries. (patterns/three-stage-flow-aggregation-pipeline) Raw eBPF flow records show individual network hops through intermediaries — "App A → Load Balancer → App B, or App A → NAT Gateway → App B", two separate edges — but the topology that's useful to engineers is the direct application-to-application relationship, "App A → App B". Three stages:

  3. Stage 1 — initial Kafka aggregation. Consume flow records from Kafka.
  4. Stage 2 — network-intermediary resolution. Identify load balancers, NAT gateways, API gateways, and proxies; combine their incoming and outgoing flows to reconstruct the direct application-to-application path. (concepts/network-intermediary-resolution, patterns/network-intermediary-flow-resolution)
  5. Stage 3 — final aggregation with health-status integration before graph persistence.

Operational property named: "This graduated approach also prevents hot spots by distributing load across multiple points even when specific applications or network intermediaries see 100x more traffic than others." The graduated structure is also a load-balancing structure.

  1. Apache Pekko Streams runs the pipeline; auto-scaling groups provide the distribution substrate. (systems/apache-pekko) Verbatim: "We use Apache Pekko Streams (a fork of Akka) to process these flows in a distributed, fault-tolerant pipeline. The system automatically partitions work across our Auto Scaling Groups to handle the volume and provides natural backpressure handling." First wiki naming of Pekko as a Netflix production stream-processing substrate. The backpressure property is load-bearing — the same Pekko-Streams "natural backpressure" that makes Pekko a sensible choice for ingesting "millions of flow records per second."

  2. Storage model: per-layer specialised substrate, not a uniform graph DB. Network and IPC layers ride Netflix's graph database (described as "an abstraction layer built on top of our distributed key-value storage infrastructure… designed for high- throughput graph operations at our scale, with fast multi-hop traversal capabilities"). The tracing layer rides columnar storage optimized for analytical queries instead. The choice is shaped by query patterns: graph-DB for path traversal, columnar for trace-shape aggregation analytics.

  3. gRPC API exposes multi-hop traversal + tier/domain filters + pagination — sub-second response. (systems/grpc) Verbatim: "We expose the topology through a gRPC service that supports multi-hop traversal, filtering by availability tier and business domain, pagination for large result sets, and sub-second query response times." This makes Service Topology a substrate for automation, not just a UI: Netflix's Platform Modernization Engineering team uses the API "to verify that critical Live services have proper availability tier classifications throughout their dependency chains." Resilience frameworks, blast-radius calculators, incident-response automation are explicitly named as intended consumers.

  4. Time travel via window-accumulating aggregators, not snapshot- per-slice. (concepts/temporal-topology-query, patterns/time-window-aggregator-for-temporal-graph) Engineers can "query what the topology looked like at specific points in the past." The implementation: "instead of storing every time slice separately, we use layer-specific aggregators that accumulate topology data across windows, allowing us to reconstruct historical views efficiently without exploding storage costs." The key efficiency property — historical reconstruction without per-slice storage — is what makes time travel affordable on a topology that updates "as services deploy, as traffic patterns shift."

  5. Programmatic blast-radius computation is now a graph traversal instead of mental stitching. (concepts/topology-aware-blast-radius) The post lists "What will be impacted if I take this down for maintenance?" and "Is my problem caused by an upstream issue, or am I the root cause?" as canonical engineer questions surfaced over four years of support-request mining. Service Topology turns both into a bounded-depth graph traversal: upstream traversal answers blast-radius; downstream traversal points at the propagation path. "Identify which teams to notify and what to monitor."

  6. The forward-looking thesis: the topology graph IS the knowledge graph foundation for automated root-cause analysis. "Imagine an intelligent agent that continuously crawls the topology graph, correlates failures across dependencies, understands historical patterns, and surfaces likely root causes automatically. Service topology provides the knowledge graph foundation that makes this kind of intelligent automation possible." Sibling to the 2026-05-04 Model Lifecycle Graph in graph-substrate-as-AI- substrate framing — Netflix is converging on graph-shaped substrates as the layer over which automated reasoning will run.

  7. The strategic-motivation framing: Live and Ads make MTTR matter more. "New initiatives like our Live programming and Ads-supported plans require even more sophisticated monitoring and faster troubleshooting. Live events can't wait for lengthy incident investigations. The scale and real-time nature of these systems demanded better tooling." This positions Service Topology as part of a wider Netflix infrastructure shift toward real-time observability — the same arc the Live Operations post canonicalised on the human-process side.

Architectural overview

                    ┌─────────────────────────────────┐
                    │   Three independent capture     │
                    │   substrates (per host / app)   │
                    └─────────────────────────────────┘
        ┌───────────────────┬───────┴────────┬──────────────────┐
        ▼                   ▼                ▼                  ▼
   eBPF flows         IPC metrics      End-to-end traces
   (FlowExporter      (gRPC/GraphQL/   (sampled, request-
    sidecar)          REST instrum.)   level)
        │                   │                │
        ▼                   ▼                ▼
   Multi-region Kafka (flow logs path) / metric pipelines / trace pipelines
        │                   │                │
        ▼                   ▼                ▼
   ┌─────────────────────────────────────────────────────────────┐
   │  Apache Pekko Streams (eBPF flow path) — three-stage pipe   │
   │  Stage 1: initial Kafka aggregation                         │
   │  Stage 2: network-intermediary resolution                   │
   │           (LBs / NAT-GWs / API-GWs / proxies collapsed)     │
   │  Stage 3: final aggregation + health-status enrichment      │
   └─────────────────────────────────────────────────────────────┘
        ┌───────────────────┬───────┴────────┬──────────────────┐
        ▼                   ▼                ▼
   Network graph        IPC graph        Trace graph
   (Netflix graph DB    (Netflix graph    (columnar storage,
    partition)           DB partition)     analytical queries)
        │                   │                │
        └────────────┬──────┴────────┬───────┘
                     │               │
                     ▼               ▼
                 ┌─────────────────────────┐
                 │  gRPC topology service  │
                 │  - multi-hop traversal  │
                 │  - tier / domain filter │
                 │  - pagination           │
                 │  - parallel-merge for   │
                 │    unified-view queries │
                 └─────────────────────────┘
            ┌─────────────────┴────────────────┐
            ▼                                  ▼
        UI (engineers)                Automated systems
        - dependency view             - resilience frameworks
        - blast-radius preview        - blast-radius calculators
        - layer toggles               - incident-response
        - status overlay              - tier verification

Operational numbers

The post is announcement-shape, not a deployment retrospective. Concrete figures disclosed:

  • Sub-second multi-hop traversal, including across all three layers merged.
  • Multi-region Kafka ingestion across all AWS regions Netflix operates in. "Millions of flow records as they arrive" (qualitative).
  • 100× traffic skew is the order of magnitude of hot-spot asymmetry the three-stage pipeline must absorb.
  • Four years of support-ticket mining drove the design brief.

Numbers explicitly not in this post but expected in the follow- up: Kafka consumer lag handling, GC pause management, reactive- streams stall debugging, hot-node mitigation, fleet-wide flow rate, graph size, query QPS, layer-specific cardinality, ASG sizing.

Caveats

  • Architecture-disclosure shape, not retrospective. "The technical details of building this at Netflix scale — handling Kafka lag, managing memory and garbage collection, optimizing distributed processing, debugging reactive streams — deserve their own discussion. We learned a lot, and we'll share those lessons in our next post." This source is best read as the conceptual map the future post will fill in.
  • No sampling rates / coverage envelopes named. Tracing layer is "sampled" but no sampling rate is given. IPC layer's instrumentation coverage across the fleet is named as a known blind spot but not quantified.
  • Multi-region semantics are described qualitatively. The post says "flow logs from Kafka across multiple AWS regions where Netflix operates" — but does not detail per-region graph partitioning, cross-region merge mechanics, or whether the unified view is per-region or global. This will likely be in the engineering-deep-dive follow-up.
  • The "graph database" is referenced but not architecturally decomposed in this post. A separate Netflix post — High-Throughput Graph Abstraction at Netflix Part I — is linked as the substrate description; this post takes that graph DB as given.
  • The columnar substrate for tracing is not named. The post says "columnar storage optimized for analytical queries" but does not say whether it is Iceberg, Druid, ClickHouse, an internal substrate, or some combination.
  • Health-status integration is described at the architectural layer only. "Stage 3 performs final aggregation with health status integration before graph persistence." The status-source system, freshness contract, and overlay-on-graph mechanics are not detailed.
  • Time-travel mechanism is described at one paragraph's depth. "Layer-specific aggregators that accumulate topology data across windows" names the technique class but not the window granularity, retention horizon, or query-side reconstruction cost model.
  • Authorship: Parth Jain (lead author + post author), Rakesh Sukumar, Yingwu Zhao, Renzo Sanchez-Silva, Nathan Fisher. Acknowledged: Observability team, graph database platform team, Platform Modernization Engineering, Live, and Ads teams as feedback / use-case providers.

Sibling cross-refs

Source

Last updated · 542 distilled / 1,571 read