CONCEPT Cited by 1 source
Topology-aware blast radius¶
Topology-aware blast radius is the operationalisation of blast radius as a graph traversal over a service dependency graph rather than a memory-and-spreadsheet exercise. Instead of asking "what could break if I take this service down?" and answering it from human knowledge of the call graph, the system computes the answer programmatically by walking upstream from the target service in the topology graph.
The wiki's first canonical instance is Netflix's Service Topology, where blast-radius computation is named as one of the three canonical engineer questions the system answers:
"What's the blast radius? When something breaks or needs to go down for maintenance, what else will be affected? Which teams need to be notified?" (sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map)
Why "topology-aware" matters¶
The traditional shape of blast radius in Amazon-style cell-based architecture is spatial / structural — how many cells, regions, accounts, or shards a fault can reach. Topology-aware blast radius is a dependency-direction version of the same concept: how many upstream services depend on the service you're about to change?
The two are complementary:
- Cell-based blast radius answers "if cell N fails, how many customers are affected?" — by partitioning customers into cells and ensuring no fault in cell N can reach cell M.
- Topology-aware blast radius answers "if service S goes down, which other services break?" — by walking the dependency graph upstream from S.
A robust system needs both. Topology-aware blast radius is what service owners reason about when planning maintenance windows, deploys, or rollbacks; cell-based blast radius is what platform owners reason about when designing isolation boundaries.
What the graph traversal looks like¶
"Before taking a service down for maintenance or making significant changes, see exactly what will be impacted. Identify which teams to notify and what to monitor." (Source: same)
The traversal:
- Start node = the service being changed / taken down.
- Direction = upstream (callers of S, then callers of those, etc.).
- Depth = bounded by the question being asked. A maintenance window often only needs depth-1 (direct callers); a major architectural change might need transitive depth-N.
- Filter overlays — by availability tier (focus on Tier 0/1 first), by business domain, by ownership.
- Output = the set of impacted services + their owners (for notification routing) + their tiers (for severity weighting).
In Netflix Service Topology this runs over the gRPC API — pagination, multi-hop traversal, and sub-second response times are explicit API guarantees, which is what makes the traversal viable as a pre-deploy check.
Programmatic, not just visual¶
The post draws the distinction between engineer-facing UI and automated systems API for blast-radius computation:
"Use our gRPC API to integrate topology information into automated systems. For example, our Platform Modernization Engineering team uses this to verify that critical Live services have proper availability tier classifications throughout their dependency chains."
This makes blast radius a machine-checkable property rather than a human-judgment call. Concrete consumers named in the post:
- Resilience frameworks — circuit breakers, retry policies, bulkheads. Knowing the dependency graph lets the framework reason about failure containment automatically.
- Blast-radius calculators — explicit pre-deploy / pre-change tooling.
- Incident-response automation — once a fault is detected, walk downstream to enumerate cascading impact in real time.
- Tier-classification verifiers (Platform Modernization Engineering) — "verify that critical Live services have proper availability tier classifications throughout their dependency chains." A tier-policy assertion: if a Tier 0 service depends on a Tier 3 service, that's a misclassification or a policy violation. Computable from the graph.
Root-cause localisation: the inverse traversal¶
The same graph supports the opposite-direction traversal:
"Where's the source? Is my problem caused by an upstream issue, or am I the root cause that's cascading to others?"
With a topology graph that also carries health-status overlay (Stage 3 of Netflix's three- stage aggregation pipeline integrates health status), root-cause localisation becomes: walk upstream from the alerting service, return the topmost unhealthy ancestor.
Blast radius and root-cause localisation are dual operations on the same graph — same substrate, opposite direction.
Why this is harder without a real-time graph¶
Pre-Service-Topology, the post describes the prior state:
"For an engineer at 3am, having to mentally stitch together information from multiple tools is slow, error-prone, and stressful."
Without a real-time graph:
- Stale information. Architecture diagrams in a wiki page are out of date; "yesterday's topology map is archaeology, not observability."
- Per-tool fragmentation. Metrics tools, tracing tools, log tools — all see fragments of the dependency graph; reasoning about blast radius means consolidating manually.
- Knowledge concentration. The most-tenured engineers can do it from memory; new engineers can't.
A topology-aware blast-radius computation flattens this — anyone with API access gets the same answer.
Seen in¶
- sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map — canonical wiki source. "Identify which teams to notify and what to monitor" + Platform Modernization Engineering tier- verification use case.
Related¶
- systems/netflix-service-topology — canonical instance
- concepts/blast-radius — the broader concept
- concepts/service-dependency-graph — the substrate
- concepts/observability
- companies/netflix