Skip to content

AWS 2025-12-11

Read original ↗

Architecting conversational observability for cloud applications

AWS Architecture Blog reference-architecture post (2025-12-11) for a generative-AI-powered Kubernetes troubleshooting assistant. Companion piece to the later 2026-03-18 AWS DevOps Agent product launch — this one is a build-it-yourself blueprint with two alternate deployment topologies and a reference implementation in aws-samples/sample-eks-troubleshooting-rag-chatbot, where the 2026-03-18 post is the AWS-managed-service shape of the same idea. The two posts together pin the canonical wiki shape of AI-augmented observability on EKS: same problem, same signals, two vendor relationships.

Summary

Modern microservice applications on EKS / ECS / Lambda scatter telemetry across layers; engineers stitch logs, events, metrics, and live cluster state together manually during incidents, which drives up MTTR. The post builds a chatbot assistant that combines historical telemetry retrieved via vector search with real-time cluster state via allowlisted read-only kubectl, and iterates in a LLM ↔ cluster loop until it has enough context to propose a resolution. Two architectures are presented: (1) a RAG-based chatbot (Fluent BitKinesis Data StreamsLambda calling Titan Embeddings v2 → vectors in OpenSearch Serverless → Gradio web chatbot → troubleshooting assistant in the cluster), and (2) a Strands Agents SDK multi-agent system (Agent Orchestrator + Memory Agent + K8s Specialist; 1024-dim embeddings in S3 Vectors; EKS MCP Server exposing K8s operations as MCP tools; Slack bot as the UI; Pod Identity for AWS service access).

Key takeaways

  1. Two observability-layer stitching problems are collapsed into one interface. A Kubernetes operator traditionally runs kubectl describe, kubectl logs, kubectl get events, and cross- references grafana dashboards + application logs in parallel terminals. The chatbot collapses both multi-layer telemetry correlation and real-time state queries behind one natural- language surface, bounded by what the allowlist permits. MTTR framing is the core business case: per the 2024 Observability Pulse Report cited in the post, 48% of organizations say lack of team knowledge is their biggest observability challenge and 82% say production-issue resolution takes >1h. (Source: sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications)
  2. Telemetry-to-RAG is a distinct pipeline shape, not a generic ingestion pipeline. The diagrammed pipeline is Fluent Bit → Kinesis Data Streams → Lambda → Bedrock (Titan Embeddings v2) → OpenSearch Serverless. Every layer is deliberate: Fluent Bit streams telemetry from pods with low overhead; Kinesis batches + smooths spikes; Lambda does normalization and embedding in batches for cost (explicit "Pro tip" in the post); OpenSearch Serverless removes cluster sizing from the critical path. The RAG retrieval step at query time is embed(user_query)k-NN lookup in OpenSearch → prompt augmentation — the same shape any retrieval-augmented chatbot uses, now over operational telemetry instead of product docs.
  3. Real-time cluster state is the second context input, not a replacement for stored telemetry. The architecture deliberately combines historical telemetry (embedded in OpenSearch from the ingest pipeline) with live kubectl output from an in-cluster troubleshooting assistant. "This cycle gradually builds a richer picture of the issue by combining historical telemetry with real- time cluster state to speed up root cause analysis." Either signal alone is insufficient — the value is the fusion. Same insight as AWS DevOps Agent's K8s-API + OTel fusion but achieved with live kubectl as the real-time path.
  4. Allowlisted read-only kubectl is the agent-safety primitive. The troubleshooting assistant in the cluster "executes [kubectl commands] with a service account that has read-only permissions, following the principle of least privilege". Only kubectl operations on a static allowlist are permitted; the assistant cannot apply, patch, delete, or exec. This is the canonical allowlisted-read-only-agent-actions pattern — constraining the side-effect surface of an LLM-driven agent to a vetted set of verbs, with RBAC enforcement at the Kubernetes API server as a second line of defense (concepts/defense-in-depth). Note this is static allowlisting — the list is code, not LLM-chosen.
  5. RAG vs agentic-MCP is the current-state design fork. The post ships two deployments controlled by a single Terraform deployment_type variable: (1) classic RAG-based chatbot (default) where retrieval and kubectl execution are orchestrated by the chatbot app itself, and (2) Strands Agents SDK multi- agent system where Agent Orchestrator / Memory Agent / K8s Specialist each own a narrow scope (patterns/specialized-agent-decomposition), vectors live in S3 Vectors as 1024-dim embeddings for cost, and K8s operations are exposed via EKS MCP Server using the MCP protocol. The agentic shape replaces custom orchestration code with a standardized tool-call protocol, and replaces OpenSearch Serverless's RAM-heavy cost profile with S3's cold-tier vector storage. No quantitative comparison is offered.
  6. Iterative investigation, not single-shot prompting. The illustrated end-to-end flow is explicitly a loop: query → retrieve telemetry → LLM proposes kubectl commands → assistant runs them → output re-fed to LLM → LLM decides continue or conclude → (optional more rounds) → final resolution. This is the agentic troubleshooting loop primitive — LLM is the planner, the allowlist-constrained assistant is the hands, OpenSearch + live cluster are the eyes, and the stopping criterion is "enough context" (LLM-judged).
  7. Security discipline carried through the reference architecture. Named tactics: (a) strict kubectl allowlist + RBAC (read-only for pods/services/events/logs in specific namespaces); (b) sanitize application logs before embedding to prevent PII / secrets leaking into vectors; (c) KMS encryption for Kinesis in-transit and OpenSearch at-rest; (d) private subnets
  8. VPC endpoints per Well-Architected Security Pillar; (e) validate user inputs against prompt injection. The log-sanitization rule is notable: "sanitizing application logs before embedding generation to help prevent sensitive information exposure" — embeddings are derived artifacts and inherit the governance posture of their source data.
  9. Per-service compute-layer generality is asserted but not shown. Post explicitly claims the approach extends to ECS and Lambda: "a similar approach can be extended to other compute services like Amazon ECS or AWS Lambda". The telemetry shape (logs + events + metrics) is universal; the in-cluster troubleshooting assistant would be replaced by a service-specific executor (aws ecs describe-*, aws logs filter-log-events, CloudWatch Logs Insights). Only EKS is demonstrated.

Systems introduced

  • systems/strands-agents-sdk — open-source Python SDK for building agentic systems on AWS; multi-agent orchestration, tool calling, session management. Used in the post to build a three- agent system (Orchestrator / Memory / K8s Specialist).
  • systems/eks-mcp-server — AWS-Labs-published Model Context Protocol server exposing Kubernetes / EKS operations as MCP tools. The agent-native interface to a cluster; replaces hand-rolled kubectl wrappers.
  • systems/fluent-bit — CNCF telemetry processor and forwarder; lightweight in-pod or DaemonSet deployment collecting application logs, kubelet logs, and Kubernetes events. The canonical Kubernetes ingestion point feeding the RAG pipeline.
  • systems/amazon-kinesis-data-streams — managed durable streaming substrate; the buffer between Fluent Bit's firehose and Lambda's embedding work; enables batching for cost.

Systems extended

  • systems/aws-eks — the investigation target. The troubleshooting-assistant container runs as a pod with a read-only service account.
  • systems/amazon-bedrock — hosts Titan Embeddings v2 for the RAG path and the LLM (unspecified in the post) for reasoning / kubectl-command-generation / final-answer synthesis.
  • systems/amazon-opensearch-service — OpenSearch Serverless as the RAM-backed vector store for the RAG deployment; k-NN plugin serves retrieval at query time.
  • systems/s3-vectors — cold-tier vector store alternative in the Strands deployment; 1024-dimensional embeddings; cost- optimized vs OpenSearch Serverless's in-memory model.
  • systems/amazon-titan-embeddings — specific model named as amazon.titan-embed-text-v2:0.
  • systems/aws-lambda — telemetry-normalization + embedding- generation compute in the RAG pipeline; batching explicitly recommended.
  • systems/model-context-protocol — the agent ↔ tool protocol used by Strands + EKS MCP Server in deployment option 2.
  • systems/aws-kms — encryption at rest (OpenSearch) and in transit (Kinesis).

Concepts introduced

  • concepts/agentic-troubleshooting-loop — iterative LLM ↔ tool-assistant investigation cycle; LLM proposes queries, tool assistant executes, output re-enters LLM context, repeats until the LLM judges enough context for resolution.

Concepts extended

Patterns introduced

  • patterns/allowlisted-read-only-agent-actions — constrain an LLM-driven agent's side effects to a static allowlist of safe verbs (kubectl get/describe/logs/events), enforced at both application layer and platform RBAC. Generalizes across compute fabrics (ECS describe-* / Lambda get-function-* / any platform-side read-only API surface).
  • patterns/telemetry-to-rag-pipeline — streaming telemetry into a vector store for LLM augmentation: Fluent Bit → Kinesis → Lambda+Bedrock embeddings → OpenSearch / S3 Vectors; log sanitization before embedding; batch at the Lambda layer for cost; allow hot (OpenSearch Serverless) / cold (S3 Vectors) tiering choice.

Patterns extended

  • patterns/specialized-agent-decomposition — Strands deployment's three-agent split (Orchestrator / Memory / K8s Specialist) exemplifies decomposing agentic responsibility into narrow tool-surface scopes; same shape as Databricks Storex agents and Cloudflare's Agent Lee team.

Architecture diagrams referenced

The post includes four figures (inline ![] CloudFront PNGs from the original post — not captured in the raw markdown):

  • Figure 1: multitude of telemetry sources (kubelet logs, app logs, events, metrics) in a cluster.
  • Figure 2: telemetry ingestion — Fluent Bit → Kinesis → Lambda → Bedrock embeddings → OpenSearch Serverless.
  • Figure 3: chatbot retrieval + augmentation flow — user query → vector search → augmented prompt → LLM → kubectl command generation.
  • Figure 4: iterative troubleshooting loop — LLM ↔ assistant cycle with a conclude / continue decision.

Operational numbers / scale cited

Item Value Source
Team-knowledge challenge 48% of orgs 2024 Observability Pulse Report
Production-issue resolution >1 hour 82% of teams 2024 Observability Pulse Report
S3 Vectors embedding dimensionality 1024-dim Strands deployment
Embedding model amazon.titan-embed-text-v2:0 RAG deployment
Strands agents 3 (Orchestrator, Memory, K8s Specialist) Post

Not disclosed: post-deployment MTTR reduction, cost per query token, OpenSearch-Serverless capacity (OCUs), Kinesis shards, Lambda concurrency, query latency (retrieval + generation), embedding-job throughput, production cluster sizes, eval or accuracy metrics, prompt-injection guardrail implementation, specific LLM model used for reasoning, kubectl command allowlist contents, RBAC role definitions.

Caveats

  • Reference-architecture post, not a production retrospective. No customer-facing deployment is described; architecture is demonstrated via the sample GitHub repo + two re:Invent / KubeCon talks cited at the end. Marketing-leaning in tone around deployment_type flexibility.
  • No evaluation data — no accuracy numbers, no MTTR delta, no user studies, no prompt-injection-resistance testing cited, no hallucination-rate discussion, no guardrails specifics.
  • Two architectures presented in parallel rather than compared — the reader is not told when to prefer RAG vs Strands, or what the cost/latency/quality trade-offs are beyond "S3 Vectors is cost-optimized".
  • Compute-fabric generality is asserted, not shown. ECS and Lambda are mentioned as extending naturally; no examples or pipeline variations are given.
  • LLM model for reasoning is unspecified — Titan Embeddings v2 is named for embeddings but the reasoning model (Claude / Titan / Llama / etc) is left open, which is a surprising omission for a reference architecture.
  • No failure-mode discussion for the iterative loop itself — what happens if the LLM enters a query-loop it can't terminate? What's the max-iteration cutoff? How are contradictory signals (stored telemetry says A, live kubectl says B) reconciled?
  • Sits one notch below the 2026-03-18 AWS DevOps Agent post architecturally — that one explicitly names the two-path K8s discovery methodology (concepts/telemetry-based-resource-discovery), the baseline-learning step, and the confidence-scored RCA ranking. This earlier post stops at "loop until enough context".

Source

Last updated · 200 distilled / 1,178 read