AI-powered event response for Amazon EKS¶
AWS Architecture Blog product post on AWS DevOps Agent, a fully managed autonomous AI agent (built on Amazon Bedrock) that investigates operational events on Amazon EKS clusters. Serves as the AWS-vendor peer to Datadog's Bits AI SRE — same category (specialized-agent-for-incident-response, web-UI-over-live-telemetry) but different shipping shape (AWS managed service vs SaaS product). Post is tutorial-heavy (sample EKS cluster deploy + Python traffic generator + UI walkthrough across four numbered scenarios) with thin quantitative substance; architectural contribution is the telemetry-based Kubernetes resource-discovery methodology and the multi-source data-correlation + baseline-learning + confidence-scored RCA investigation workflow.
Summary¶
The agent builds a live dependency graph of an EKS cluster by combining two discovery paths — a Kubernetes API scan for resource state + an OpenTelemetry telemetry scan for runtime relationships — then correlates metrics (Amazon Managed Prometheus), logs (Amazon CloudWatch Logs), traces (AWS X-Ray), and topology into a unified context for investigation. Investigations flow through four phases: data collection → ML/statistical pattern analysis against an established baseline → confidence-scored root-cause ranking → mitigation recommendations. Tutorial uses the aws-samples/sample-eks-containers-observability repo + a Python traffic-generator to establish a 15-minute baseline at 10 RPS / 5% error, then simulate a production event targeting one app at 30 RPS / 25% error for 10 minutes.
Key takeaways¶
-
Two-path Kubernetes discovery. Agent queries the Kubernetes API for static resource state (labels / annotations / resource specs / network topology) and analyzes OpenTelemetry data for runtime relationships — service-mesh traffic patterns (pod↔pod), distributed traces (request flow across microservices), metric attribution (perf metric → pod / container / node). Neither path alone is sufficient: the K8s API gives you the graph nodes, OTel gives you the weighted edges. Unified context = {state, recent events, performance data}. (concepts/telemetry-based-resource-discovery)
-
Baseline → deviation → RCA workflow. Scenario-1 "normal load" establishes baselines across request patterns, error rates, resource utilization, and service dependencies. Scenario-2 "simulated production event" (target one app at 3× baseline RPS
-
5× baseline error rate) triggers detection, error-mode decomposition (HTTP 500 / timeout / connection refused), resource correlation, and a ranked confidence-scored list of potential root causes. This mirrors the Bits AI SRE shape (live-telemetry → compose-tools → attribute cause) but the AWS post doesn't expose the agent-tool contract or give eval/correctness numbers — unlike Datadog's eval-platform retrospective.
-
"Prevention" surface as weekly-eval feed. Separate Prevention tab where the agent aggregates patterns across past investigations and proposes code / observability / infrastructure / governance recommendations on a weekly budget (~15h compute). Demo screenshot shows an empty recommendation set ("no new recommendations generated for the past week") — confirms preview-product status; no customer-production quality signal published.
-
Topology view at 1,806 resources discovered. One demo screenshot cites 1,806 resources in one Agent Space; no other scaling / latency / cost / investigation-duration numbers disclosed. Compare Datadog's eval platform which quantifies label quality (↑30%), validation time (↓95%), and per-label throughput — AWS's post is purely qualitative.
-
Named data-source integrations. Amazon Managed Prometheus (metrics), Amazon CloudWatch Logs (logs), AWS X-Ray (traces), EKS topology (K8s resource graph). Agent Space is the tenant configuration unit — one per EKS cluster — carrying IAM roles, IdP access, data-source endpoints, and scope definition (account + region + cluster + app context).
-
Tutorial-heavy delivery. Majority of the post is
kubectl port-forward+pip install requests+ traffic-gen CLI invocations + Agent-Space-creation-form screenshots, not architecture. Read for the discovery methodology and investigation-workflow shape; don't expect tool-composition internals or eval detail.
Systems extracted¶
- systems/aws-devops-agent — the agent itself; new system page
- systems/aws-eks — target substrate (existing)
- systems/amazon-bedrock — underlying model runtime; stub
- systems/amazon-managed-prometheus — metrics source; stub
- systems/aws-x-ray — trace source; stub
- systems/kubernetes — discovery target (existing)
Concepts extracted¶
- concepts/telemetry-based-resource-discovery — new: using OpenTelemetry traces/metrics/service-mesh data to infer runtime relationships between Kubernetes resources that the static K8s API cannot reveal
- concepts/observability — metrics / logs / traces triad, the triple-input substrate for AI-driven investigation (existing)
Operational numbers¶
- 1,806 discovered resources in one demo Agent Space topology view
- 15h weekly compute budget on the Prevention (preventive-recommendations) tab
- Baseline traffic: 10 RPS / 5% error / 15-min duration across 4 apps
- Simulated event: 30 RPS / 25% error / 10-min duration on one target app
- No disclosed end-to-end investigation latency, tool-call count, accuracy against a labelled set, cost, or cold-start
Caveats¶
- Preview product, tutorial-first post. Post is a product walkthrough for a preview-stage service — screenshots show "no recommendations this week" on the Prevention tab. Don't over-weight architectural claims.
- No eval / quality numbers. Unlike the Datadog Bits AI SRE eval-platform post, AWS doesn't publish how they measure agent correctness, trajectory, or regression across model updates. The investigation-workflow steps ("ML algorithms", "statistical analysis", "confidence scoring") are named but not quantified.
- Tool-composition internals not exposed. The agent's actual
tool inventory (CloudWatch API calls, Prometheus PromQL
queries, X-Ray trace queries, K8s API calls) is not documented
beyond one screenshot showing a
describe_alarm_historycall. Contrast with the explicit tool contract in Datadog's MCP-server retrospective. - Agent Space quotas / multi-tenancy semantics absent. One Agent Space per cluster is the demo shape; how an enterprise with thousands of EKS clusters composes Agent Spaces isn't covered.
- Figure captions / screenshots do most of the architectural work. The body is skeletal between figure groups — treat the article as a product-launch blog, not an engineering retrospective.
Raw source¶
raw/aws/2026-03-18-ai-powered-event-response-for-amazon-eks-e525e1fd.md