SYSTEM Cited by 2 sources

AWS DevOps Agent¶

AWS DevOps Agent is AWS's fully managed autonomous AI agent for EKS (and broader AWS) incident investigation and preventive recommendations. Built on Amazon Bedrock; accessed through a purpose-built web UI behind an Agent Space (tenant configuration unit — IAM + IdP + data-source endpoints + scope). Preview-stage product at ingest time (Source: sources/2026-03-18-aws-ai-powered-event-response-for-amazon-eks).

Structural peer to Datadog's Bits AI SRE: same category (hosted agent for live-telemetry incident investigation) but different shipping shape — AWS managed service scoped to AWS cloud resources vs SaaS product scoped to Datadog's observability backend.

Data sources (per Agent Space)¶

Metrics — Amazon Managed Prometheus workspace endpoint.
Logs — Amazon CloudWatch Log groups.
Traces — AWS X-Ray service configuration.
Topology — EKS cluster resource graph (via Kubernetes API) + service-map information.

Kubernetes-resource discovery methodology¶

Two complementary paths, then unify (concepts/telemetry-based-resource-discovery):

Static path (Kubernetes API scan). - Initial scan: all resources in relevant namespaces. - Metadata enrichment: labels / annotations / resource specs (CPU/memory requests & limits, health checks, env vars) / network topology (pod IPs, service cluster IPs, ingress rules, network policies).

Telemetry path (OpenTelemetry analysis). - Service mesh analysis — network traffic patterns between pods → service-to-service communication edges. - Trace correlation — distributed traces → request-flow map across microservices. - Metric attribution — perf metrics → specific pods / containers / nodes.

Dependency graph build — combines static nodes with telemetry edges into a live graph. Context aggregation — {resource state, recent events, performance data} flattened into a unified view the agent reasons over.

Investigation workflow¶

Investigation trigger. User picks a scenario template (e.g. "High CPU usage", "Error rate spike", "Performance degradation") and defines scope (AWS account / region / cluster / app context) + time range.
Data collection. Correlates Managed Prometheus metrics + CloudWatch logs + X-Ray traces + K8s topology for the window.
Analysis. ML pattern/anomaly detection, cross-source correlation, statistical significance testing, comparison against established baseline (see next section).
Root cause. Ranked list with confidence scores, contributing factors, correlated evidence, timeline reconstruction.
Mitigation. Immediate + long-term recommendations, runbook-style guidance.

Baseline learning¶

During normal operation the agent records typical request patterns / response times, normal error-rate distribution, resource-utilization per pod/node, and service-dependency edges. The baseline is what detection compares against — 5% error rate for 15 minutes at steady 10 RPS is "normal noise"; 25% error rate at 30 RPS on one service is a detectable deviation. The tutorial leaves the baseline-storage, baseline-update, and baseline-per-tenant design questions unspecified.

Preventive recommendations (separate surface)¶

Per-week budget (~15h compute disclosed in the tutorial) the agent re-analyzes past investigations and proposes recommendations across four categories:

Code optimization (app-level)
Observability (instrumentation gaps)
Infrastructure (resource limits, HPA config, scaling policies)
Governance (policy / compliance)

Demo screenshot shows all four at zero → preview-product status confirmed; no customer-production quality signal published.

Topology view¶

Interactive graph of discovered resources filtered by type (Containers / ECS clusters / StackSets / standalone). Demo screenshot: 1,806 total discovered resources in one Agent Space.

Design trade-off (vs Bits AI SRE)¶

	AWS DevOps Agent	Bits AI SRE
Vendor relationship	Customer's AWS account owns the agent	Datadog-hosted SaaS
Scope	AWS resources (EKS + adjacent AWS services)	Datadog-ingested telemetry (multi-cloud)
Data sources	Managed Prometheus / CloudWatch Logs / X-Ray / K8s API	Datadog metrics / logs / traces / infra / network / monitors
Delivery	Agent Space in AWS Console	Datadog web UI
Exposed tools	Not disclosed	Not disclosed publicly, but Datadog MCP Server roadmap is convergent
Eval platform	Not disclosed	Dedicated offline replay platform with labels + trajectory scoring + pass@k + weekly full runs

Caveats¶

Preview product, tutorial-first post. Expect product-shape to change; don't over-index on any specific UI affordance.
No eval numbers. Correctness, trajectory quality, regression behavior across model updates, incident-duration deltas vs a control — none disclosed.
Tool inventory not documented. The agent's tool contract (PromQL queries, CloudWatch API calls, K8s API calls, X-Ray trace queries) is not exposed beyond one screenshot showing describe_alarm_history.
One Agent Space per cluster demo shape. Multi-cluster / multi-account composition semantics not covered.

Seen in¶

sources/2026-03-18-aws-ai-powered-event-response-for-amazon-eks — the original product-launch post. Canonical source for the discovery methodology + investigation-workflow shape.
sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications — the earlier self-build blueprint (December 2025) that the 2026-03-18 managed-service launch succeeds by three months. Same problem (Kubernetes troubleshooting with LLMs + telemetry), two vendor relationships: this page is the managed-service shape, the 2025-12-11 post is the customer-built shape with two deployment options (RAG-based chatbot on OpenSearch Serverless, or Strands Agents + EKS MCP Server + S3 Vectors). The managed service adds the two-path discovery methodology, learned baselines, and confidence-scored RCA ranking; the underlying agentic troubleshooting loop is the same primitive. Pair cleanly illustrates the self-build → AWS-managed-service evolution for agent- augmented observability.

systems/bits-ai-sre — the Datadog peer; see design trade-off above.
systems/amazon-bedrock — the underlying foundation-model runtime.
systems/aws-eks, systems/kubernetes — the primary investigation target.
systems/amazon-managed-prometheus, systems/aws-x-ray — data-source systems.
concepts/telemetry-based-resource-discovery — the core methodology contribution of the post.
concepts/observability — the input-substrate concept.