CONCEPT Cited by 1 source
Service dependency graph¶
A service dependency graph is a runtime-derived map of which services in a distributed system actually call which other services, represented as a queryable graph (nodes = services, edges = call relationships). The defining property is that it is derived from observed traffic, not from configuration files or design documents — "actual runtime connections based on real traffic", in the framing of Netflix's Service Topology.
Why "real-time" is in the contract¶
The post that canonicalises this concept on the wiki opens with the failure mode of not having a real-time graph:
"Dependency maps that are hours old are useless in dynamic environments where services deploy multiple times per day. We needed near real-time updates." (sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map)
The same source frames the difference as the living-map vs static-diagram dichotomy:
"It's a living map. It's not a static diagram drawn in a design document that goes out of date the moment it's published. It's continuously updated based on actual traffic."
The trust property is what makes the map operationally useful: "the map reflects reality, not someone's idea of what the architecture should be."
Why one graph isn't enough¶
The wiki's first canonical instance — systems/netflix-service-topology — explicitly rejects the single-source-of-truth model:
"No single source tells the complete story."
Three independent capture substrates produce three graphs in the Netflix design:
| Layer | Strength | Limitation |
|---|---|---|
| eBPF network flows | Universal coverage | No application context |
| IPC metrics (gRPC/GraphQL/REST) | Endpoint, protocol, latency, errors | Only instrumented services |
| End-to-end traces (sampled) | Actual runtime call paths | May miss rare paths |
This is canonicalised separately as concepts/multi-source-topology-fusion — the structural pattern of using multiple complementary capture substrates and merging at query time rather than write time (patterns/three-layer-graph-merge-on-query).
What it answers (the canonical engineer questions)¶
Netflix mined four years of internal support tickets and surfaced a consistent pattern of dependency questions as the design brief:
- "What are my upstream and downstream dependencies?"
- "Is this failure in my service, or is something I depend on broken?"
- "Which services will be impacted if I take this down for maintenance?"
- "Why is this service showing as 'Unknown' in my metrics?"
- "What changed in my call path recently that could explain this behavior?"
A service dependency graph is the substrate that turns each of these questions into a bounded-depth graph traversal — replacing the "mental stitching together [of] information from multiple tools" the post calls out as the prior failure mode.
Three primary use cases for the graph¶
1. Dependency walks (upstream / downstream)¶
The minimum-viable use case. Given a service, enumerate its callers and callees with depth N. Filtering by availability tier and business domain matters — engineers debugging a Tier 0 issue don't want to walk into Tier 3 noise.
2. Programmable blast radius¶
concepts/topology-aware-blast-radius — "Before taking a service down for maintenance or making significant changes, see exactly what will be impacted." Becomes a graph traversal upward from the target service, with tier and ownership decoration so the result is "identify which teams to notify and what to monitor."
This use case requires the graph as a machine-readable substrate (API surface, not just UI) so the answer can be computed programmatically by resilience frameworks and incident-response automation.
3. Root-cause localisation¶
"Is my problem caused by an upstream issue, or am I the root cause that's cascading to others?" Becomes a traversal in the opposite direction with health-status overlay — find the topmost unhealthy ancestor in the upstream subgraph.
What overlay information makes the graph operational¶
A bare dependency graph is necessary but not sufficient. The Netflix post enumerates the overlays that turn the topology into a troubleshooting-grade artifact:
- Health status — at-glance per-node health; the canonical use case is "quickly identify if a problem you're seeing is actually originating somewhere else."
- Availability tier — Tier 0 / Tier 1 / etc., as a filter and decoration.
- Business domain — for filtering and ownership routing.
- Endpoint / protocol detail (from the IPC layer).
- Individual traces overlaid on the topology — "engineers can both see the aggregated pattern and drill into individual traces" — bridges the aggregated view back to single-request debugging.
Time-travel as a first-class capability¶
A real-time graph naturally raises the question: what did this graph look like before the incident? Netflix names this temporal topology query — the ability to "query what the topology looked like at specific points in the past." Implemented via time-window-accumulating aggregators so historical reconstruction doesn't require per-slice storage. (patterns/time-window-aggregator-for-temporal-graph)
Distinguishing from related concepts¶
- vs concepts/service-topology (Cloudflare framing) — the Cloudflare service topology concept is a configuration abstraction about "at which POPs is this service's IP allowed to be advertised?". Same phrase, very different concept. Netflix's Service Topology is a service dependency graph — see systems/netflix-service-topology for the disambiguation note.
- vs static architecture diagrams — the load-bearing difference is "the map reflects reality, not someone's idea of what the architecture should be." Static diagrams encode intent; dependency graphs encode behaviour.
- vs single-source dependency maps (e.g. tracing-only or metrics-only) — the multi-source insight is that any single capture substrate has a structural blind spot.
- vs deployment / change graphs — change-event overlay is Netflix's roadmap item to pair the dependency graph with deployment data; the two are complementary, not the same graph.
As a knowledge-graph foundation¶
Netflix's forward-looking framing positions the dependency graph as the knowledge-graph substrate for automated reasoning:
"Imagine an intelligent agent that continuously crawls the topology graph, correlates failures across dependencies, understands historical patterns, and surfaces likely root causes automatically. Service topology provides the knowledge graph foundation that makes this kind of intelligent automation possible."
This connects to the broader Netflix arc of graph-shaped substrates as programmable foundations — see Netflix UDA (knowledge graph as semantic substrate) and Model Lifecycle Graph (graph as ML-metadata substrate).
Seen in¶
- sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map — canonical wiki source. Netflix Service Topology is the wiki's first first-class instance of a real-time service-dependency graph; the post enumerates the four-year support-ticket design brief, the three-source-three-graph architecture, and the knowledge-graph framing.
Related¶
- systems/netflix-service-topology — canonical instance
- concepts/multi-source-topology-fusion — the why-three-graphs framing
- concepts/network-intermediary-resolution — what Stage 2 of the aggregation pipeline does to make edges application-to-application
- concepts/temporal-topology-query — time travel
- concepts/topology-aware-blast-radius — programmable blast radius via graph traversal
- concepts/observability — broader containing discipline
- concepts/blast-radius
- concepts/service-topology — the other "service topology" (Cloudflare BGP / anycast); disambiguation
- concepts/knowledge-graph — the substrate-class framing
- patterns/three-layer-graph-merge-on-query
- companies/netflix