Skip to content

SYSTEM Cited by 2 sources

AWS DevOps Agent

AWS DevOps Agent is AWS's fully managed autonomous AI agent for EKS (and broader AWS) incident investigation and preventive recommendations. Built on Amazon Bedrock; accessed through a purpose-built web UI behind an Agent Space (tenant configuration unit — IAM + IdP + data-source endpoints + scope). Preview-stage product at ingest time (Source: sources/2026-03-18-aws-ai-powered-event-response-for-amazon-eks).

Structural peer to Datadog's Bits AI SRE: same category (hosted agent for live-telemetry incident investigation) but different shipping shape — AWS managed service scoped to AWS cloud resources vs SaaS product scoped to Datadog's observability backend.

Data sources (per Agent Space)

  • MetricsAmazon Managed Prometheus workspace endpoint.
  • Logs — Amazon CloudWatch Log groups.
  • TracesAWS X-Ray service configuration.
  • Topology — EKS cluster resource graph (via Kubernetes API) + service-map information.

Kubernetes-resource discovery methodology

Two complementary paths, then unify (concepts/telemetry-based-resource-discovery):

Static path (Kubernetes API scan). - Initial scan: all resources in relevant namespaces. - Metadata enrichment: labels / annotations / resource specs (CPU/memory requests & limits, health checks, env vars) / network topology (pod IPs, service cluster IPs, ingress rules, network policies).

Telemetry path (OpenTelemetry analysis). - Service mesh analysis — network traffic patterns between pods → service-to-service communication edges. - Trace correlation — distributed traces → request-flow map across microservices. - Metric attribution — perf metrics → specific pods / containers / nodes.

Dependency graph build — combines static nodes with telemetry edges into a live graph. Context aggregation — {resource state, recent events, performance data} flattened into a unified view the agent reasons over.

Investigation workflow

  1. Investigation trigger. User picks a scenario template (e.g. "High CPU usage", "Error rate spike", "Performance degradation") and defines scope (AWS account / region / cluster / app context) + time range.
  2. Data collection. Correlates Managed Prometheus metrics + CloudWatch logs + X-Ray traces + K8s topology for the window.
  3. Analysis. ML pattern/anomaly detection, cross-source correlation, statistical significance testing, comparison against established baseline (see next section).
  4. Root cause. Ranked list with confidence scores, contributing factors, correlated evidence, timeline reconstruction.
  5. Mitigation. Immediate + long-term recommendations, runbook-style guidance.

Baseline learning

During normal operation the agent records typical request patterns / response times, normal error-rate distribution, resource-utilization per pod/node, and service-dependency edges. The baseline is what detection compares against — 5% error rate for 15 minutes at steady 10 RPS is "normal noise"; 25% error rate at 30 RPS on one service is a detectable deviation. The tutorial leaves the baseline-storage, baseline-update, and baseline-per-tenant design questions unspecified.

Preventive recommendations (separate surface)

Per-week budget (~15h compute disclosed in the tutorial) the agent re-analyzes past investigations and proposes recommendations across four categories:

  • Code optimization (app-level)
  • Observability (instrumentation gaps)
  • Infrastructure (resource limits, HPA config, scaling policies)
  • Governance (policy / compliance)

Demo screenshot shows all four at zero → preview-product status confirmed; no customer-production quality signal published.

Topology view

Interactive graph of discovered resources filtered by type (Containers / ECS clusters / StackSets / standalone). Demo screenshot: 1,806 total discovered resources in one Agent Space.

Design trade-off (vs Bits AI SRE)

AWS DevOps Agent Bits AI SRE
Vendor relationship Customer's AWS account owns the agent Datadog-hosted SaaS
Scope AWS resources (EKS + adjacent AWS services) Datadog-ingested telemetry (multi-cloud)
Data sources Managed Prometheus / CloudWatch Logs / X-Ray / K8s API Datadog metrics / logs / traces / infra / network / monitors
Delivery Agent Space in AWS Console Datadog web UI
Exposed tools Not disclosed Not disclosed publicly, but Datadog MCP Server roadmap is convergent
Eval platform Not disclosed Dedicated offline replay platform with labels + trajectory scoring + pass@k + weekly full runs

Caveats

  • Preview product, tutorial-first post. Expect product-shape to change; don't over-index on any specific UI affordance.
  • No eval numbers. Correctness, trajectory quality, regression behavior across model updates, incident-duration deltas vs a control — none disclosed.
  • Tool inventory not documented. The agent's tool contract (PromQL queries, CloudWatch API calls, K8s API calls, X-Ray trace queries) is not exposed beyond one screenshot showing describe_alarm_history.
  • One Agent Space per cluster demo shape. Multi-cluster / multi-account composition semantics not covered.

Seen in

Last updated · 200 distilled / 1,178 read