Skip to content

CONCEPT Cited by 1 source

Telemetry as discovery substrate

Definition

Telemetry as discovery substrate is the architectural bet that live observability data (metrics + logs + traces) is a sufficient input to discover the shape of a running system — services, dependencies, deployment topology, log formats, metric naming — without requiring any additional declarative input from users (no service catalogue, no annotations, no manual registration).

Canonical framing from Grafana Assistant's 2026-05-01 launch post (Source: sources/2026-05-01-grafana-how-grafana-assistant-learns-your-infrastructure-before-you-even-ask):

"Your existing telemetry data is the input. The assistant reads what's already in your Prometheus, Loki, and Tempo data sources and builds its understanding from there. If you have metrics, you get this infrastructure memory capability."

Why it's an architecturally distinct bet

The alternative approach — declarative service catalogue — treats the catalogue as the authoritative model of the infrastructure and the running system as the deployment of that model. Examples: Backstage's software catalog, Cortex XSOAR asset inventories, manually-curated service-registry systems.

Declarative catalogues have a structural failure mode: drift. The catalogue says payments-api depends on notifications, but six months ago the team migrated notifications away and now depends on comms-v2. Nobody updated the catalogue. The source of truth diverges from reality.

Telemetry as discovery substrate inverts the trust hierarchy: the running system (as observed through emitted metrics, traces, and logs) is the source of truth. The memory/catalogue is derived, not maintained. Drift is bounded by refresh cadence instead of by human diligence.

What telemetry captures

Signal What it reveals
Metrics (Prometheus) Services exist (as label sets on counters/histograms), deployment identities, resource shape (CPU, memory), traffic / latency / error / saturation golden signals
Traces (Tempo) Service-to-service call graphs, dependency topology, request paths
Logs (Loki) Log format per service (JSON / logfmt / unstructured), log labels, semantic field names
K8s metadata (scraped) Namespace/cluster membership, replica counts, container details, scaling configuration

Combined, the four signals provide enough surface area to reconstruct a five-category service knowledge schema per service group without any additional input.

Preconditions

This bet only works if every service worth asking about is already emitting metrics / logs / traces. That precondition is stronger than it sounds:

  • Instrumented services — any service worth observing is already observed. True for mature observability cultures, false for the long tail of legacy internal tools.
  • Consistent naming — services are identifiable through label conventions (e.g. service=checkout-api) or K8s metadata. Ad-hoc label hygiene breaks service-group inference.
  • Trace coverage for dependency inference — services without distributed tracing yield memories with empty dependency sections.

Un-instrumented services are invisible. The zero-config commitment is therefore also a gating condition: the bet works only to the extent that observability coverage is already universal.

Contrast with alternative substrates

Substrate Source of truth Drift failure mode
Declarative catalogue Human-maintained register Docs drift from reality
Code scan (systems/meta-ai-precompute-engine) Source tree + config Deployed version ≠ mainline
Telemetry (this page) Live emitted signals Un-instrumented services invisible

Each substrate is correct for a different knowledge type: code scan is correct for "how does the code work"; telemetry is correct for "what is actually running, what's calling what, at what rate." For infrastructure memory specifically (concepts/agent-infrastructure-memory), telemetry is the substrate that matches the question shape.

Why this eliminates a class of catalogue-rot failures

Traditional service catalogues fail when:

  1. A new team deploys a service and forgets to register it.
  2. A service is decommissioned but not removed from the catalogue.
  3. A dependency is added/changed/removed between refreshes.
  4. A metric is renamed.

A telemetry-driven memory handles (1) and (3) automatically (new emissions are discovered on the next refresh) and (2) by absence (no emissions → not in memory). (4) is mitigated by weekly refresh + manual-trigger escape hatch for deliberate rollouts.

Seen in

  • sources/2026-05-01-grafana-how-grafana-assistant-learns-your-infrastructure-before-you-even-ask — canonical first-wiki articulation of telemetry as discovery substrate, at the Grafana Assistant scale. The zero-configuration UX commitment ("no setup steps, no configuration files, no scheduled jobs to manage" / "If you have metrics, you get this infrastructure memory capability") is only possible because the substrate bet holds — customer telemetry is already there, the agent just has to walk it.
Last updated · 445 distilled / 1,275 read