Skip to content

PATTERN Cited by 1 source

Time-window aggregator for temporal graph

Time-window aggregator for temporal graph is the pattern of storing time-windowed accumulated aggregates of a continuously- mutating graph instead of per-time-slice snapshots, so that historical point-in-time views can be reconstructed at query time without paying linear-in-time storage cost.

The wiki's canonical instance is Netflix's Service Topology, which uses this pattern to make time-travel topology query affordable on a graph of "thousands of microservices" updated "as services deploy multiple times per day."

The cost problem this solves

The naive approach to historical graph queries — "snapshot the whole graph every minute and keep every snapshot" — has cost:

Storage = (graph size) × (slice frequency) × (retention horizon)

For a Netflix-scale topology graph at minute granularity over a year of retention:

~10⁵ edges × 525,600 minutes ≈ 5×10¹⁰ edge-snapshots

Even with compression, this is prohibitively large, and most of it is redundant — the vast majority of edges are stable across consecutive windows.

The Netflix post calls out this exact cost framing:

"This time-travel capability is powered by time-window aggregation — instead of storing every time slice separately, we use layer-specific aggregators that accumulate topology data across windows, allowing us to reconstruct historical views efficiently without exploding storage costs." (sources/2026-05-29-netflix-from-silos-to-service-topology-why-netflix-built-a-real-time-service-map)

Pattern shape

Real-time graph mutations
┌──────────────────────────────────────────────┐
│  Per-window aggregator                       │
│  - For each time window W:                   │
│    - Accumulate edge presence across W       │
│    - Accumulate per-edge aggregates          │
│      (count, error rate, latency, …)         │
│    - Record edge-state changes               │
└──────────────────────────────────────────────┘
Window-keyed store (one record per window per (edge, layer))
Historical view reconstruction:
  Given target time T, fetch the windows covering T,
  apply accumulated state forward from a baseline,
  return the reconstructed graph at T.

The key insight: accumulated state per window + baseline + deltas = much cheaper than full snapshots, while still supporting point-in-time reconstruction.

When to apply

  • The graph being archived is highly correlated across time — most edges persist for many windows; only a small fraction change per window.
  • Approximate point-in-time accuracy is acceptable — the reconstructed view at time T is accurate to within window granularity W, not microsecond-precise.
  • The query workload is historical inspection, not high-frequency time-series rollback (e.g. "what did the graph look like 6 hours before the incident?" rather than "give me every state every 10 ms").

When NOT to apply

  • The graph mutates so rapidly that most edges change per window — there's no compression benefit over per-slice snapshots.
  • Sub-window precision is required (e.g. forensic analysis at millisecond granularity).
  • Per-edge auditing (every state change must be recoverable exactly, not approximately).

Mechanics

Window granularity choice

The window granularity W is the resolution of historical queries and a fundamental trade-off:

  • Shorter W → finer historical resolution, more storage.
  • Longer W → coarser historical resolution, less storage.

Typical choices: 1 minute / 5 minutes / 1 hour, with multi-tier windows (recent at 1-min granularity, older at 1-hour) common.

Netflix doesn't disclose the window granularity in this post.

Per-window accumulation

For each window W, the aggregator accumulates:

  • Edge presence — did this edge appear in W?
  • Per-edge aggregates — flow count, byte volume, error rate, latency distribution, etc.
  • Edge-state changes (delta from prior window) — new edges, disappeared edges, edges with significant property change.

Layer-specific aggregators

Netflix's framing names this explicitly: "layer-specific aggregators that accumulate topology data across windows." Each layer (network, IPC, tracing) has its own aggregator tuned for its data shape:

  • Network layer (eBPF) — likely accumulates edge presence + flow volume per window.
  • IPC layer — likely accumulates per-edge endpoint set, error rate, latency distribution per window.
  • Tracing layer — already columnar/analytical; native time-window aggregation fits the substrate naturally.

The per-layer choice mirrors the per-layer storage choice: each layer's aggregator targets its own access pattern.

Reconstruction at query time

For a query "give me the topology at time T":

  1. Find the window W containing T.
  2. Fetch W's accumulated state.
  3. (Optional) Combine with a baseline + N preceding window deltas if reconstruction needs to be more precise than W-granularity.
  4. Return the reconstructed graph.

The reconstruction cost is O(graph size at T) + O(window deltas) — much cheaper than scanning per-slice snapshots over the retention horizon.

Generalisation beyond service topology

The pattern applies to any highly-correlated time-evolving graph where historical inspection is needed:

  • Social graphs — friendship edges change slowly relative to message events.
  • Permission graphs — IAM permissions evolve much more slowly than the requests that flow through them.
  • Knowledge graphs — facts in a knowledge graph evolve slowly relative to queries.
  • Asset-inventory graphs — cloud assets persist for hours/days in a typical fleet.

The shared property: the graph state at time T is more efficiently encoded as a baseline + N deltas than as a snapshot.

Sibling patterns

Seen in

Last updated · 542 distilled / 1,571 read