SYSTEM Cited by 2 sources
AWS DevOps Agent¶
AWS DevOps Agent is AWS's fully managed autonomous AI agent for EKS (and broader AWS) incident investigation and preventive recommendations. Built on Amazon Bedrock; accessed through a purpose-built web UI behind an Agent Space (tenant configuration unit — IAM + IdP + data-source endpoints + scope). Preview-stage product at ingest time (Source: sources/2026-03-18-aws-ai-powered-event-response-for-amazon-eks).
Structural peer to Datadog's Bits AI SRE: same category (hosted agent for live-telemetry incident investigation) but different shipping shape — AWS managed service scoped to AWS cloud resources vs SaaS product scoped to Datadog's observability backend.
Data sources (per Agent Space)¶
- Metrics — Amazon Managed Prometheus workspace endpoint.
- Logs — Amazon CloudWatch Log groups.
- Traces — AWS X-Ray service configuration.
- Topology — EKS cluster resource graph (via Kubernetes API) + service-map information.
Kubernetes-resource discovery methodology¶
Two complementary paths, then unify (concepts/telemetry-based-resource-discovery):
Static path (Kubernetes API scan). - Initial scan: all resources in relevant namespaces. - Metadata enrichment: labels / annotations / resource specs (CPU/memory requests & limits, health checks, env vars) / network topology (pod IPs, service cluster IPs, ingress rules, network policies).
Telemetry path (OpenTelemetry analysis). - Service mesh analysis — network traffic patterns between pods → service-to-service communication edges. - Trace correlation — distributed traces → request-flow map across microservices. - Metric attribution — perf metrics → specific pods / containers / nodes.
Dependency graph build — combines static nodes with telemetry edges into a live graph. Context aggregation — {resource state, recent events, performance data} flattened into a unified view the agent reasons over.
Investigation workflow¶
- Investigation trigger. User picks a scenario template (e.g. "High CPU usage", "Error rate spike", "Performance degradation") and defines scope (AWS account / region / cluster / app context) + time range.
- Data collection. Correlates Managed Prometheus metrics + CloudWatch logs + X-Ray traces + K8s topology for the window.
- Analysis. ML pattern/anomaly detection, cross-source correlation, statistical significance testing, comparison against established baseline (see next section).
- Root cause. Ranked list with confidence scores, contributing factors, correlated evidence, timeline reconstruction.
- Mitigation. Immediate + long-term recommendations, runbook-style guidance.
Baseline learning¶
During normal operation the agent records typical request patterns / response times, normal error-rate distribution, resource-utilization per pod/node, and service-dependency edges. The baseline is what detection compares against — 5% error rate for 15 minutes at steady 10 RPS is "normal noise"; 25% error rate at 30 RPS on one service is a detectable deviation. The tutorial leaves the baseline-storage, baseline-update, and baseline-per-tenant design questions unspecified.
Preventive recommendations (separate surface)¶
Per-week budget (~15h compute disclosed in the tutorial) the agent re-analyzes past investigations and proposes recommendations across four categories:
- Code optimization (app-level)
- Observability (instrumentation gaps)
- Infrastructure (resource limits, HPA config, scaling policies)
- Governance (policy / compliance)
Demo screenshot shows all four at zero → preview-product status confirmed; no customer-production quality signal published.
Topology view¶
Interactive graph of discovered resources filtered by type (Containers / ECS clusters / StackSets / standalone). Demo screenshot: 1,806 total discovered resources in one Agent Space.
Design trade-off (vs Bits AI SRE)¶
| AWS DevOps Agent | Bits AI SRE | |
|---|---|---|
| Vendor relationship | Customer's AWS account owns the agent | Datadog-hosted SaaS |
| Scope | AWS resources (EKS + adjacent AWS services) | Datadog-ingested telemetry (multi-cloud) |
| Data sources | Managed Prometheus / CloudWatch Logs / X-Ray / K8s API | Datadog metrics / logs / traces / infra / network / monitors |
| Delivery | Agent Space in AWS Console | Datadog web UI |
| Exposed tools | Not disclosed | Not disclosed publicly, but Datadog MCP Server roadmap is convergent |
| Eval platform | Not disclosed | Dedicated offline replay platform with labels + trajectory scoring + pass@k + weekly full runs |
Caveats¶
- Preview product, tutorial-first post. Expect product-shape to change; don't over-index on any specific UI affordance.
- No eval numbers. Correctness, trajectory quality, regression behavior across model updates, incident-duration deltas vs a control — none disclosed.
- Tool inventory not documented. The agent's tool contract
(PromQL queries, CloudWatch API calls, K8s API calls, X-Ray
trace queries) is not exposed beyond one screenshot showing
describe_alarm_history. - One Agent Space per cluster demo shape. Multi-cluster / multi-account composition semantics not covered.
Seen in¶
- sources/2026-03-18-aws-ai-powered-event-response-for-amazon-eks — the original product-launch post. Canonical source for the discovery methodology + investigation-workflow shape.
- sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications — the earlier self-build blueprint (December 2025) that the 2026-03-18 managed-service launch succeeds by three months. Same problem (Kubernetes troubleshooting with LLMs + telemetry), two vendor relationships: this page is the managed-service shape, the 2025-12-11 post is the customer-built shape with two deployment options (RAG-based chatbot on OpenSearch Serverless, or Strands Agents + EKS MCP Server + S3 Vectors). The managed service adds the two-path discovery methodology, learned baselines, and confidence-scored RCA ranking; the underlying agentic troubleshooting loop is the same primitive. Pair cleanly illustrates the self-build → AWS-managed-service evolution for agent- augmented observability.
Related¶
- systems/bits-ai-sre — the Datadog peer; see design trade-off above.
- systems/amazon-bedrock — the underlying foundation-model runtime.
- systems/aws-eks, systems/kubernetes — the primary investigation target.
- systems/amazon-managed-prometheus, systems/aws-x-ray — data-source systems.
- concepts/telemetry-based-resource-discovery — the core methodology contribution of the post.
- concepts/observability — the input-substrate concept.