Skip to content

SYSTEM Cited by 2 sources

Netflix Service Topology

Service Topology is Netflix's real-time service-dependency graph of "thousands of microservices" spanning streaming, Live programming, and Ads-supported plans. It is a living map of runtime service relationships — "continuously updated based on actual traffic" — accessible from a UI for human troubleshooting and from a gRPC API for automated systems (resilience frameworks, blast-radius calculators, incident-response automation, tier-classification verifiers).

The novel architectural choice: three independent capture substrates produce three physically separate graphs, each queryable on its own or merged at query time when a unified view is requested (patterns/three-layer-graph-merge-on-query). Each layer compensates for the others' blind spots:

Layer Substrate Strength Limitation
Network systems/netflix-flowexporter eBPF flow logs Universal coverage — every service shows up No application context (endpoint / path)
Application IPC metrics (gRPC / GraphQL / REST) Endpoint, protocol, error rate, latency Only instrumented services
Request Aggregated end-to-end traces Actual runtime call paths, request-level overlay Sampled — may miss rare paths

(Source: sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map)

What it answers

The post enumerates the three questions every engineer asks as the canonical design brief, mined from four years of internal support requests:

  1. Which services depend on each other?"actual runtime connections based on real traffic", not configuration-file theoretical dependencies.
  2. What's the blast radius? — programmable topology-aware blast radius via upstream graph traversal; "identify which teams to notify and what to monitor".
  3. Where's the source?"is my problem caused by an upstream issue, or am I the root cause that's cascading to others?"

The 3 a.m. operating context the post opens with — "engineer gets paged… one of our critical services is showing elevated error rates" — makes the mental-stitching cost of the prior state load-bearing: "having to mentally stitch together information from multiple tools is slow, error-prone, and stressful."

Architecture

Three independent capture substrates → three physically separate graphs

The post's load-bearing structural choice. Verbatim:

"Each source creates its own graph that is physically separate — the network layer in one graph database partition, the IPC layer in another partition, and the tracing layer using columnar storage optimized for analytical queries. This physical separation allows each layer to evolve independently and be queried in parallel."

Choosing physical separation over write-time fusion buys:

  • Independent evolution. New IPC instrumentation rolling out doesn't touch the eBPF graph; tracing-sampling changes don't touch the application graph.
  • Substrate fit per query shape. Graph DB for path traversal on the network and IPC layers; columnar storage for analytical trace queries.
  • Parallel query at merge time. "When users request a unified view, we execute traversal queries across all layers simultaneously and merge results, achieving sub-second response times even when combining all three layers."

This is the canonical patterns/three-layer-graph-merge-on-query instance on the wiki.

Storage: per-layer specialised substrate

  • Network and IPC graphs ride on Netflix's graph database"an abstraction layer built on top of our distributed key-value storage infrastructure… designed for high-throughput graph operations at our scale, with fast multi-hop traversal capabilities." Decomposed in the linked High-Throughput Graph Abstraction at Netflix Part I.
  • Tracing graph rides on columnar storage optimized for analytical queries (substrate not named in this post). Reflects the trace-shape access pattern: aggregate over many traces in a time window rather than hop-by-hop graph traversal.

Three-stage flow-log aggregation pipeline

The eBPF-flow path is the most architecturally detailed leg of the post. Three stages, all running on Apache Pekko Streams (Akka fork) over multi-region Kafka:

  1. Stage 1 — initial Kafka aggregation. Consume flow records as they arrive across all AWS regions Netflix operates in.
  2. Stage 2 — network-intermediary resolution. "Raw flow logs show two separate hops (App A → Load Balancer → App B), but the resolved graph stores the direct application-to-application relationship (App A → App B)." Stage 2 "identifies network intermediaries (load balancers, NAT gateways, API gateways, proxies) and combines their incoming and outgoing flows to reconstruct direct application-to-application paths." (concepts/network-intermediary-resolution, patterns/network-intermediary-flow-resolution)
  3. Stage 3 — final aggregation with health-status integration before graph persistence.

The graduated-stage structure has an emergent load-balancing property: "This graduated approach also prevents hot spots by distributing load across multiple points even when specific applications or network intermediaries see 100x more traffic than others." — i.e. a hot intermediary in Stage 2 doesn't pull all the work onto one node because Stages 1 and 3 are separately partitioned. (patterns/three-stage-flow-aggregation-pipeline)

Distribution substrate: Apache Pekko Streams + Auto Scaling Groups

"We use Apache Pekko Streams (a fork of Akka) to process these flows in a distributed, fault-tolerant pipeline. The system automatically partitions work across our Auto Scaling Groups to handle the volume and provides natural backpressure handling."

The backpressure property is load-bearing — millions of flow records per second cannot be naively pulled with unbounded buffering. Pekko's Reactive Streams compliance gives Stage N → Stage N+1 demand signalling for free.

gRPC API

Single API surface; supports:

  • Multi-hop traversal (the load-bearing primitive for blast radius and dependency walks).
  • Filtering by availability tier (Tier 0, Tier 1, …) and business domain.
  • Pagination for large result sets.
  • Sub-second response times, including for unified-view queries spanning all three layers.

Programmatic consumers named in the post:

  • The Platform Modernization Engineering team uses the gRPC API "to verify that critical Live services have proper availability tier classifications throughout their dependency chains" — a policy-as-graph-traversal use case.
  • Resilience frameworks, blast-radius calculators, incident-response automation — named as intended consumers, not yet itemised.

Time-travel via window-accumulating aggregators

"Instead of storing every time slice separately, we use layer- specific aggregators that accumulate topology data across windows, allowing us to reconstruct historical views efficiently without exploding storage costs."

The motivating use case: "understand what changed in dependencies around the time an issue started, or see how your service's dependency footprint has evolved over time." This is what makes temporal topology query a first-class capability rather than a forensic-log scan. (patterns/time-window-aggregator-for-temporal-graph)

Health-status overlay

Stage 3 of the aggregation pipeline integrates health status before graph persistence, so the graph itself carries the per-node / per-edge health state. Engineers can "see not just the topology, but which services in the call path are experiencing issues. This is integrated with health status tracking, so you can quickly identify if a problem you're seeing is actually originating somewhere else."

What engineers can do

Verbatim use-case enumeration from the post:

  • Visualize Dependencies — upstream and downstream, with tier / domain filters; toggle unified vs single-layer views.
  • Jump to Detailed Signals — from any service, hop to logs, traces, and detailed metrics in their respective tools (the topology gives the right context starting point).
  • Understand Blast Radius"before taking a service down for maintenance or making significant changes, see exactly what will be impacted."
  • Overlay Health Status.
  • Query Programmatically via the gRPC API.
  • Investigate Faster during incidents — "quickly identify if a failure is local or if it's propagating from somewhere else in the call graph."
  • Plan Changes Confidently.
  • Time Travel Through Topology"query what the topology looked like at specific points in the past."

Why a "living map"

The post repeatedly contrasts Service Topology with architecture diagrams that go stale the moment they're published:

"It's a living map. It's not a static diagram drawn in a design document that goes out of date the moment it's published. It's continuously updated based on actual traffic."

Concretely:

  • New service → API call → edge appears with near-real-time freshness.
  • Service stops calling a dependency → edge fades from the graph.
  • Service deploys with changed behaviour → topology reflects it.
  • Incidents impact health → status overlay updates in real-time.

The trust property is what makes the map operationally useful: "the map reflects reality, not someone's idea of what the architecture should be."

Roadmap (forward references in the post)

  • Change Event Overlay. "We're working to surface deployment events, configuration changes, and other mutations alongside the topology graph. Correlation becomes easier when you can see both the dependencies and what changed when." — pairs deployment events with the topology so the "what changed in my call path recently" question is answerable visually.
  • Richer Context. Endpoint-level details, protocol info, network path context — fleshing out the IPC and trace layers.
  • Automated root cause analysis. "An intelligent agent that continuously crawls the topology graph, correlates failures across dependencies, understands historical patterns, and surfaces likely root causes automatically. Service topology provides the knowledge graph foundation that makes this kind of intelligent automation possible." — positions the topology graph as the knowledge-graph substrate for an LLM/agent layer above it. Sibling positioning to other Netflix graph-substrate posts: UDA and the Model Lifecycle Graph.

Disambiguation: which "service topology"?

This system is named Service Topology at Netflix, but the phrase is overloaded across the wiki:

  • Netflix Service Topology (this page) — a real-time service- dependency graph derived from runtime traffic. The relevant concept is concepts/service-dependency-graph.
  • Cloudflare's "service topology" (concepts/service-topology) — a configuration abstraction answering "at which POPs is this service's IP address allowed to be advertised?". A different concept entirely; both pages cross-reference each other.

Operational numbers (disclosed in the source)

  • Sub-second multi-hop traversal, including across all three layers merged.
  • 100× is the order of magnitude of traffic asymmetry the three-stage pipeline must absorb ("specific applications or network intermediaries see 100x more traffic than others").
  • Four years of support-ticket pattern mining drove the design brief.

Numbers explicitly not in this post but expected in the follow-up: Kafka consumer lag handling envelope, GC pause mitigation strategies, reactive-streams stall debugging, hot-node mitigation, fleet-wide flow rate, graph size, query QPS, ASG sizing, sampling rates, tracing coverage envelope, multi-region graph-merge mechanics, columnar substrate identity.

Storage substrate (Graph Abstraction Part-I confirms)

The 2026-05-29 [Service Topology post] originally named "Netflix's graph database, an abstraction layer built on top of our distributed key-value storage infrastructure" as the storage substrate for the network and IPC layers, deferring substrate decomposition to a separate post. That post — Part-I of High-Throughput Graph Abstraction at Netflix, published the same day — is the dedicated decomposition. Confirmed details that compose with the Service Topology architecture:

  • Service Topology runs the network-flow graph and IPC graph as separate Graph Abstraction namespaces (each with its own KV-namespace storage layout: per-node-type + forward link + reverse link + edge property).
  • The trace layer rides a separate columnar substrate outside Graph Abstraction. The columnar substrate identity remains undisclosed.
  • Graph Abstraction's per-graph headline: ~10 M ops/sec across ~650 TB; single-digit-ms p99 edge persistence; sub-50ms p90 on 2-hop traversals. Service Topology's "sub-second multi-hop traversal" sits comfortably inside this envelope.
  • Service Topology is one of three named consumers of Graph Abstraction (alongside RDG and Netflix Gaming Social Graph).

The Netflix Graph Abstraction page is the canonical home for the substrate decomposition.

Seen in

  • sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map — canonical wiki ingest. Describes the three-source-three-graph architecture, the three-stage Pekko Streams flow-log aggregation pipeline, the network-intermediary resolution at Stage 2, the health-status integration at Stage 3, the per-layer storage choice, the gRPC API, time-travel via window-accumulating aggregators, the four-year support-ticket design brief, and the forward-looking automated-RCA thesis.
Last updated · 542 distilled / 1,571 read