Skip to content

CONCEPT Cited by 1 source

Telemetry-based resource discovery

Telemetry-based resource discovery is the technique of inferring runtime relationships between infrastructure resources from observability data (traces, service-mesh traffic, metric labels) rather than from (or in addition to) static inventory APIs.

Canonical substrate: Kubernetes. The cluster's static API gives you nodes of the graph (Pods / Deployments / Services / ConfigMaps / Ingress / NetworkPolicies) with metadata (labels, annotations, resource requests/limits). But the static API does not tell you which pods actually talk to which pods under load, which service handles a given request path end-to-end, or which deployment version is on the hot side of an incident. Those are edges, and edges are what matter for incident investigation.

Two discovery paths

Modern AI-for-ops agents (AWS DevOps Agent, Datadog Bits AI SRE) combine both:

Static path — resource-inventory scan. - Query the Kubernetes API for namespace-scoped resources. - Extract metadata: labels, annotations, resource specs, health checks, env vars. - Walk ownership references (Deployment → ReplicaSet → Pod, Service → Endpoints). - Produce the graph nodes + static relationships.

Telemetry path — OpenTelemetry analysis. - Service-mesh traffic — per-request source/destination pairs → pod-to-pod communication edges (weighted by request volume). - Distributed traces — cross-service spans stitched into request-flow chains → end-to-end request paths through microservices. - Metric attribution — metric labels bind a performance signal (latency / CPU / memory / error rate) to a specific pod, container, or node. - Produce the weighted edges + the runtime-relationship graph.

Unification step. The two graphs fuse: nodes (from static) carry edges (from telemetry), annotated with recent events and performance data. This is what the agent reasons over.

Why neither path alone is enough

  • Static-only misses: active communication patterns (two pods configured to talk may not actually talk in production), real request paths (a span graph may diverge from a static service-mesh policy graph), hot-spot attribution (which pod is currently slow).
  • Telemetry-only misses: non-trafficked resources (a pod that hasn't served traffic yet has no spans), configuration context (labels / annotations / owner references), dormant-but-relevant infrastructure (an ingress rule or network policy that blocks traffic — visible by absence).

The composition delivers investigation-grade context: "these are the nodes, these are the currently-exercised edges, and this one has a 3× latency anomaly against the last-hour baseline."

Contrast with service-map systems

Generic service maps (Datadog APM service map, AWS X-Ray service map) render telemetry-derived call graphs but don't fuse them with cluster inventory or drive an investigation agent. Discovery in the AI-for-ops sense layers intentional reasoning over the fused graph on top: the agent queries the fused graph as part of root-cause analysis rather than just showing it to a human.

Applications

  • Incident investigation. Fused graph = search space for blast-radius analysis ("which dependents of X would this affect?"), anomaly attribution, and timeline reconstruction.
  • Baseline learning. Agents record typical request patterns, latency percentiles, and dependency edges in a "normal" window, then compare live behavior against baseline to detect deviations.
  • Preventive analysis. Offline pass over past investigations can identify recurring structural weaknesses (missing health checks, under-scoped HPA config) because the fused graph makes the structural relationships inspectable.

Caveats

  • Baseline freshness matters. The underlying resource graph mutates as deployments happen; baselines must be recomputed often enough that an anomaly against a stale baseline isn't just a topology change.
  • OTel coverage gaps become blind spots. A service without OTel instrumentation exists as a graph node with no weighted edges — reasoning over the graph will undercount its role. Mandate instrumentation or accept known-unknowns.
  • Service-mesh analysis != traces. Layer-4 mesh observations and layer-7 trace spans tell different stories; cross-checking them is itself an error-detection mechanism (trace says A→B, mesh says no A→B traffic → instrumentation bug or sampled-away span).
  • Grey-failure compatibility. This methodology is better than static discovery at catching grey failures (binary up/down health checks miss these) but only if the baseline resolution (percentiles, not means) and the anomaly threshold are tuned for subtle deviations.

Seen in

Last updated · 200 distilled / 1,178 read