Skip to content

AWS 2026-06-22

Read original ↗

Architecting AI-powered resilience framework on AWS

Summary

This AWS Architecture Blog post presents a five-layer AI-powered resilience framework that automates the discovery of infrastructure dependencies, generates targeted chaos experiments, and integrates resilience testing into CI/CD pipelines. The framework uses AWS Resilience Hub as its orchestration center, with custom AI agents on Amazon Bedrock AgentCore performing dependency discovery and experiment design. The key innovation is removing the expertise barrier from chaos engineering: automated discovery replaces weeks of manual infrastructure mapping (completing in 2–4 hours), and AI-generated experiments eliminate the need for specialized chaos-engineering knowledge.

Key takeaways

  1. Five-layer architecture: Discovery → Test Generation → Experimentation → Gap Analysis → Continuous Validation — each layer builds on the previous, with feedback loops creating a continuous improvement cycle (Source: Solution overview section).

  2. Automated dependency discovery completes in 2–4 hours for environments with thousands of resources, replacing weeks of manual infrastructure mapping. Subsequent runs process only changes tracked by AWS Config (Source: Discovery layer section).

  3. Progressive scope expansion limits blast radius during chaos experiments: starts at 1% of resources, expands incrementally (1% → 5% → 10% → 25%) based on validation results and risk tolerance (Source: Experimentation layer section).

  4. Two-tiered CI/CD resilience gates: (a) lightweight policy-as-code checks (seconds, every commit) catch configuration issues like missing health checks; (b) full resilience assessments (2–3 min per experiment) run on significant architectural changes (Source: Continuous validation layer section).

  5. Stop conditions at 10× margin below SLA: If SLA allows 1% error rate, stop conditions trigger at 0.1% — giving ample safety margin for experiments (Source: Experimentation layer section).

  6. Business impact scoring prioritizes experiments by severity × likelihood × business impact, focusing on customer-facing systems and components with high-availability architectural patterns (Source: Test generation layer section).

  7. Tiered resilience policies at enterprise scale: Mission-critical (RTO < 15 min, RPO < 5 min, 99.99%, weekly experiments), Business-critical (RTO < 1 hr, RPO < 15 min, 99.9%, monthly), Non-critical (longer windows, quarterly) (Source: Enterprise deployment section).

  8. Hub-and-spoke multi-account model for enterprise: centralized resilience testing infrastructure in a management account with distributed application ownership in spoke accounts, using AWS Organizations for cross-account experiment coordination (Source: Enterprise deployment section).

  9. Feedback loops close the learning gap: experiment results feed back into discovery (undocumented dependencies update the architecture map) and test generation (deprioritize passing scenarios, focus on emerging risks); SSM automation documents capture validated recovery procedures (Source: Continuous improvement section).

  10. Shift-left to IaC/code scanning: The post identifies the "next frontier" as scanning Infrastructure as Code and application code for resilience anti-patterns (missing circuit breakers, single-AZ dependencies) at the pull-request stage before any resource is deployed (Source: Conclusion section).

Architecture

The five layers:

Layer Function AWS Services
1. Discovery Map infrastructure + code-level dependencies Resilience Hub, Bedrock AgentCore, Config
2. Test Generation AI-generated experiment templates with safety guardrails Bedrock AgentCore, FIS, Step Functions
3. Experimentation Execute chaos tests with progressive scope + stop conditions FIS, CloudWatch
4. Gap Analysis Correlate results with resilience policies, prioritize remediation Resilience Hub
5. Continuous Validation CI/CD integration, drift detection, dashboards CodePipeline, Config, QuickSight

Operational numbers

  • Initial discovery: 2–4 hours (thousands of resources, single account)
  • Policy-as-code check: seconds per pipeline run
  • Full resilience assessment: 2–3 minutes per experiment
  • Enterprise scale: 100+ applications Tier 1 (weekly), 500+ Tier 2 (monthly), 1000+ Tier 3 (quarterly)
  • Pilot implementation: 4–6 hours with 2–3 engineers
  • Enterprise rollout timeline: 8–12 weeks

Caveats

  • This is a prescriptive architecture post, not a production retrospective — no named customer or concrete incident data demonstrating MTTR improvements firsthand.
  • Cost awareness: the framework creates billable resources across ~8 AWS services; no cost estimates provided.
  • The claimed "50% MTTR reduction" and "58% cost savings" cite the 2024 IBM Security Services Benchmark Report, not AWS-specific data.
  • Bedrock AgentCore Starter Toolkit ships broad dev/test permissions that must be scoped down before production use.

Source

Last updated · 547 distilled / 1,605 read