Architecting AI-powered resilience framework on AWS¶
Summary¶
This AWS Architecture Blog post presents a five-layer AI-powered resilience framework that automates the discovery of infrastructure dependencies, generates targeted chaos experiments, and integrates resilience testing into CI/CD pipelines. The framework uses AWS Resilience Hub as its orchestration center, with custom AI agents on Amazon Bedrock AgentCore performing dependency discovery and experiment design. The key innovation is removing the expertise barrier from chaos engineering: automated discovery replaces weeks of manual infrastructure mapping (completing in 2–4 hours), and AI-generated experiments eliminate the need for specialized chaos-engineering knowledge.
Key takeaways¶
-
Five-layer architecture: Discovery → Test Generation → Experimentation → Gap Analysis → Continuous Validation — each layer builds on the previous, with feedback loops creating a continuous improvement cycle (Source: Solution overview section).
-
Automated dependency discovery completes in 2–4 hours for environments with thousands of resources, replacing weeks of manual infrastructure mapping. Subsequent runs process only changes tracked by AWS Config (Source: Discovery layer section).
-
Progressive scope expansion limits blast radius during chaos experiments: starts at 1% of resources, expands incrementally (1% → 5% → 10% → 25%) based on validation results and risk tolerance (Source: Experimentation layer section).
-
Two-tiered CI/CD resilience gates: (a) lightweight policy-as-code checks (seconds, every commit) catch configuration issues like missing health checks; (b) full resilience assessments (2–3 min per experiment) run on significant architectural changes (Source: Continuous validation layer section).
-
Stop conditions at 10× margin below SLA: If SLA allows 1% error rate, stop conditions trigger at 0.1% — giving ample safety margin for experiments (Source: Experimentation layer section).
-
Business impact scoring prioritizes experiments by severity × likelihood × business impact, focusing on customer-facing systems and components with high-availability architectural patterns (Source: Test generation layer section).
-
Tiered resilience policies at enterprise scale: Mission-critical (RTO < 15 min, RPO < 5 min, 99.99%, weekly experiments), Business-critical (RTO < 1 hr, RPO < 15 min, 99.9%, monthly), Non-critical (longer windows, quarterly) (Source: Enterprise deployment section).
-
Hub-and-spoke multi-account model for enterprise: centralized resilience testing infrastructure in a management account with distributed application ownership in spoke accounts, using AWS Organizations for cross-account experiment coordination (Source: Enterprise deployment section).
-
Feedback loops close the learning gap: experiment results feed back into discovery (undocumented dependencies update the architecture map) and test generation (deprioritize passing scenarios, focus on emerging risks); SSM automation documents capture validated recovery procedures (Source: Continuous improvement section).
-
Shift-left to IaC/code scanning: The post identifies the "next frontier" as scanning Infrastructure as Code and application code for resilience anti-patterns (missing circuit breakers, single-AZ dependencies) at the pull-request stage before any resource is deployed (Source: Conclusion section).
Architecture¶
The five layers:
| Layer | Function | AWS Services |
|---|---|---|
| 1. Discovery | Map infrastructure + code-level dependencies | Resilience Hub, Bedrock AgentCore, Config |
| 2. Test Generation | AI-generated experiment templates with safety guardrails | Bedrock AgentCore, FIS, Step Functions |
| 3. Experimentation | Execute chaos tests with progressive scope + stop conditions | FIS, CloudWatch |
| 4. Gap Analysis | Correlate results with resilience policies, prioritize remediation | Resilience Hub |
| 5. Continuous Validation | CI/CD integration, drift detection, dashboards | CodePipeline, Config, QuickSight |
Operational numbers¶
- Initial discovery: 2–4 hours (thousands of resources, single account)
- Policy-as-code check: seconds per pipeline run
- Full resilience assessment: 2–3 minutes per experiment
- Enterprise scale: 100+ applications Tier 1 (weekly), 500+ Tier 2 (monthly), 1000+ Tier 3 (quarterly)
- Pilot implementation: 4–6 hours with 2–3 engineers
- Enterprise rollout timeline: 8–12 weeks
Caveats¶
- This is a prescriptive architecture post, not a production retrospective — no named customer or concrete incident data demonstrating MTTR improvements firsthand.
- Cost awareness: the framework creates billable resources across ~8 AWS services; no cost estimates provided.
- The claimed "50% MTTR reduction" and "58% cost savings" cite the 2024 IBM Security Services Benchmark Report, not AWS-specific data.
- Bedrock AgentCore Starter Toolkit ships broad dev/test permissions that must be scoped down before production use.
Source¶
- Original: https://aws.amazon.com/blogs/architecture/architecting-ai-powered-resilience-framework-on-aws/
- Raw markdown:
raw/aws/2026-06-22-architecting-ai-powered-resilience-framework-on-aws-edf37bb0.md
Related¶
- concepts/chaos-engineering
- concepts/blast-radius
- concepts/shift-left-validation
- concepts/dependency-discovery
- concepts/progressive-scope-expansion
- patterns/circuit-breaker
- patterns/canary-deployment
- patterns/progressive-fault-injection
- patterns/two-tiered-resilience-gate
- systems/aws-resilience-hub
- systems/aws-fault-injection-service
- systems/aws-bedrock-agentcore