Skip to content

CONCEPT Cited by 1 source

Temporal topology query

Temporal topology query is the ability to ask, of a service dependency graph, "what did this graph look like at time T?" — to time-travel the topology back to a specific past moment for incident investigation, change correlation, or evolution analysis.

The wiki's first canonical instance is Netflix's Service Topology, where time travel is named as a first-class capability:

"Query what the topology looked like at specific points in the past. Understand what changed in dependencies around the time an issue started, or see how your service's dependency footprint has evolved over time." (sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map)

Why time travel is hard on a real-time graph

A real-time topology graph is continuously mutating as services deploy and traffic patterns shift. The naive way to support historical query is to snapshot the entire graph at every time slice — but the storage cost grows linearly in (graph size) × (slice frequency) × (retention horizon).

For a Netflix-scale topology ("thousands of microservices") updated "as services deploy multiple times per day", snapshot- per-slice is unaffordable. The post calls this out explicitly:

"This time-travel capability is powered by time-window aggregation — instead of storing every time slice separately, we use layer-specific aggregators that accumulate topology data across windows, allowing us to reconstruct historical views efficiently without exploding storage costs." (patterns/time-window-aggregator-for-temporal-graph)

The structural choice: store accumulated aggregates per window, reconstruct point-in-time views from the aggregates at query time. The compression ratio comes from the assumption that most edges are stable across consecutive windows — only deltas need high-frequency recording.

Use cases the post enumerates

  • Change correlation. "Understand what changed in dependencies around the time an issue started." Pair the topology snapshot at T-incident with T-incident-minus-1h to see which edges changed.
  • Evolution analysis. "See how your service's dependency footprint has evolved over time." Capacity planning, refactor impact assessment, technical-debt visibility.
  • Postmortem grounding. Knowing the graph state during the incident — not the state today — is necessary for accurate postmortems.

The forward-looking automated-RCA case is the most ambitious: "an intelligent agent that continuously crawls the topology graph, correlates failures across dependencies, understands historical patterns, and surfaces likely root causes automatically." — a historical-pattern reasoner needs cheap historical access.

What "time-travel" doesn't mean here

The capability is graph-shape time travel, not request-content time travel:

  • ✅ "Did App A call App B at time T?"
  • ✅ "What did App B's set of dependencies look like at time T?"
  • ❌ "Show me the actual flow records from the incident window" — that's a separate question answered by the underlying flow log / trace storage, not by the topology graph.

The topology-graph layer is derived; the raw substrate (eBPF flow logs, IPC metric series, traces) retains the fine-grained data with its own retention policy.

Layer-specific aggregator design

Netflix's framing is that each of the three layers has its own aggregator"layer-specific aggregators that accumulate topology data across windows":

  • Network layer (eBPF flows) — likely accumulates edge presence + flow volume per window.
  • IPC layer — likely accumulates per-edge endpoint set, error rate, latency distribution per window.
  • Tracing layer — already columnar/analytical; native time-window aggregation fits the substrate.

The post does not decompose window granularity, retention horizon, or the reconstruction algorithm in detail (deferred to the engineering-deep-dive follow-up post).

Why this is more than just a logging concern

A flow-log archive can answer "what happened at time T?" — but each query is a scan over raw records. Temporal topology query means the graph shape itself is queryable as of time T: graph traversal operators (multi-hop walks, neighborhood, blast- radius computation) work historically the same way they work on the live graph.

The substrate-level shift: the topology graph is the queryable artifact; the raw flow logs are the substrate that builds it. Time-travel preserves the graph-API surface across the time dimension.

Seen in

Last updated · 542 distilled / 1,571 read