Skip to content

PATTERN Cited by 1 source

Telemetry-to-RAG pipeline

Intent

Build a streaming pipeline that continuously ingests operational telemetry (logs, events, metrics, traces) into a vector store so that an LLM-driven investigation agent can retrieve semantically similar past-incident signal and inject it into prompts at query time. Turns historical operational data into the retrieval corpus of a troubleshooting loop"Retrieval-Augmented Generation over telemetry".

Context

Classic RAG indexes product documentation or knowledge articles and answers natural-language questions over them. For operational questions ("why is my pod stuck in pending?", "why did checkout error-rate spike at 14:32?"), the useful retrieval corpus is not documentation — it's the team's own telemetry. Past kubelet logs, prior events, resolved incidents, application logs, metric anomalies all contain relevant signal.

Unlike documentation, telemetry:

  • streams in continuously — the pipeline must be always-on, not batch,
  • has vastly higher volume — cost engineering matters per layer,
  • can contain sensitive data — logs regularly leak PII, tokens, secrets; sanitization is not optional,
  • is latency-sensitive for ingest — recent events must be retrievable during incidents that are still in progress.

Canonical wiki reference: AWS's conversational-observability blueprint (sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications).

Mechanism

Typical AWS shape (other cloud shapes substitute equivalents):

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Fluent Bit   │───▶│ Kinesis Data │───▶│ Lambda       │
│ DaemonSet    │    │ Streams      │    │ normalize +  │
│ in cluster   │    │ (buffer)     │    │ embed (batch)│
└──────────────┘    └──────────────┘    └──────┬───────┘
                                        ┌──────────────┐
                                        │ Bedrock      │
                                        │ Titan Embed  │
                                        │ text v2      │
                                        └──────┬───────┘
                                ┌────────────────────────┐
                                │ Vector store:          │
                                │ OpenSearch Serverless  │
                                │ (hot, RAM-backed)      │
                                │ OR S3 Vectors          │
                                │ (cold, cost-optimized) │
                                └────────────────────────┘

Step-by-step:

  1. Collect telemetry in-cluster with a lightweight forwarder. Fluent Bit DaemonSet taps app logs, kubelet logs, and Kubernetes events. Low per-pod overhead; aggregates locally; forwards upstream.
  2. Buffer via a streaming substrateKinesis Data Streams decouples ingest spikes from embedding-compute capacity and provides durability during downstream outages.
  3. Normalize in a stateless compute tierLambda consumes Kinesis records, parses log lines / event records into a canonical shape, and sanitizes sensitive fields before embedding (stripping secrets, masking tokens, dropping PII fields).
  4. Embed in batches — the same Lambda calls Bedrock's embedding endpoint (Titan Embeddings v2 in the canonical reference) on a batch of normalized events. Explicit guidance from AWS: "for better performance and cost-efficiency, your Lambda functions should use batching when ingesting data from Kinesis, generating embeddings, and storing them in OpenSearch." Batching is the cost lever that makes this pipeline economical at scale.
  5. Store vectors with metadata — writes {vector, metadata} to OpenSearch Serverless (k-NN plugin) for hot, low-latency retrieval, or to S3 Vectors for cold, cost-optimized retrieval. Metadata includes timestamp, namespace, pod, severity — filter predicates at query time.
  6. Retrieve at query time — agent query → embed (same model) → k-NN search (with optional filters) → top-k telemetry snippets → augmented prompt → LLM.

Design decisions and trade-offs

  • Embedding model choice. Dimensionality / cost / quality trilemma. The canonical reference uses Titan Embeddings v2 (amazon.titan-embed-text-v2:0); the Strands variant uses 1024-dim embeddings optimized for S3 Vectors. Domain-adapted embeddings would likely beat general-purpose but are not attempted in the reference architecture.
  • Hot vs cold vector storeconcepts/hybrid-vector-tiering applies: OpenSearch Serverless is RAM-heavy and fast; S3 Vectors is cheap and cold. A production deployment may want patterns/cold-to-hot-vector-tiering — recent weeks in OpenSearch, older history in S3 Vectors.
  • Sanitization is not optional. Embeddings inherit the governance posture of their source data. A log line with an accidentally-logged API key, embedded, is a secret-in-vector- store problem that later retrieval leaks into prompts. The concepts/sensitive-data-exposure boundary sits at the Lambda normalize step — once embedded, the cost of recall is high.
  • Batch size vs freshness. Larger Lambda batches save embedding cost but delay when a given event becomes retrievable. Active incidents want small batches; steady-state ingest wants large batches. Tune per workload.
  • Schema drift. Telemetry shape changes — new log formats, new event kinds. Normalization logic drifts or breaks silently, degrading retrieval quality without obvious signal. Requires monitoring of its own.
  • Retention / cost ceiling. Vectors are not free per-record. A year of pod logs embedded is a large bill; deliberate retention / re-embedding cadence is needed.

Retrieval patterns that ride on top

  • Plain k-NN — embed the user query, return the top-k closest vectors by cosine/Euclidean distance.
  • Metadata-filtered k-NN — restrict search to a namespace, severity, or time window via metadata predicates. Often critical — an "error" query for my-service shouldn't retrieve errors from every other service.
  • Hybrid retrieval — combine k-NN with a lexical search over log contents for queries that include specific identifiers (trace IDs, pod names, error codes); see concepts/hybrid-retrieval-bm25-vectors.

When to use

  • MTTR reduction via natural-language investigation is a stated operational goal.
  • Telemetry is high-volume and multi-source enough that dashboards alone aren't cutting it.
  • Engineers repeatedly re-ask the same question across incidents — RAG memoizes the retrieval step.
  • Cross-team operational knowledge sharing is a pain point — embedded past-incident notes become reusable signal.

When not to use

  • Low-volume systems where humans can grep logs directly — the pipeline is overkill.
  • Hard-SLA real-time control loops — retrieval + LLM latency is seconds, not milliseconds; this is for investigation, not control.
  • Teams without the budget to maintain embedding cost — the pipeline's running cost scales with telemetry, and telemetry only grows.
  • Regulated data where embedding would violate data-locality constraints — vectors are derived artifacts, but still subject to sovereignty / residency rules in many jurisdictions.

Relationship to other patterns

Seen in

Last updated · 200 distilled / 1,178 read