AWS (Amazon Web Services)¶

The AWS blog family — the AWS News Blog, AWS Open Source Blog, AWS Architecture Blog, AWS Compute Blog, AWS Storage Blog, and others at aws.amazon.com/blogs/* — collectively form one of the canonical Tier-1 system-design sources. AWS blog posts vary widely in signal: at one end, substantive architecture retrospectives with quantified production numbers (Amazon Retail BDT's Spark-to-Ray migration is the canonical recent example); at the other, product PR / feature announcements filtered out per the AGENTS.md scope rules.

For the complementary (and often higher-signal) source, see companies/allthingsdistributed — Werner Vogels' blog republishes primary-source AWS / Amazon architecture content from CTO perspective.

Scope and what we ingest from AWS blogs¶

Ingest eagerly (Tier 1 treatment):

Production architecture retrospectives with concrete scaling numbers (e.g. BDT's exabyte-scale Ray migration).
Team postmortems or incident writeups with named systems.
AWS service-design posts that explain trade-offs, not just features (often cross-posted with companies/allthingsdistributed).
Open-source contribution narratives that expose internal design (DeltaCAT, Firecracker, Aurora DSQL, etc.).

Skip:

Service-GA announcements / feature launches without architectural depth (PR/FAQ posts belong on companies/allthingsdistributed if they have architectural content).
Industry / vertical marketing posts ("AI for X industry").
Pricing announcements, account-opening posts, region-launch announcements.
Customer-case-study puff pieces.
Conference-session recaps without architectural specifics.

Key systems (as surfaced in ingested sources)¶

Document intelligence / contract intelligence (2026-06-02 axis):

systems/doczy-ai — AArete's contract-intelligence SaaS solution on AWS; 2.5 M contracts / 50 M pages / 137 M Bedrock calls / 442 B tokens / 22 months / ~$330 M cumulative client savings / 99% extraction accuracy. Canonical instance of patterns/managed-ai-document-intelligence-pipeline-on-aws. (Source: sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws.)
systems/amazon-textract — managed OCR + metadata extraction service; the canonical OCR layer in AWS document-intelligence pipelines (Lambda → Textract → custom processing → Bedrock). First dedicated wiki disclosure 2026-06-02.

Compute / EC2 instance families:

systems/aws-ec2 — the umbrella virtual-machine compute service; substrate for everything below.
systems/aws-ec2-g7e — GPU-memory-optimised EC2 instance family with NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM); positioned as "a cost-efficient option to serve GPU-memory-intensive generative AI video models". Wiki-attested customer: companies/synthesia for in-house latent-diffusion video generation. (Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.)
systems/nvidia-rtx-pro-6000-blackwell — GPU under the G7e family.

Data platform / Amazon Retail BDT:

systems/ray — open-source distributed compute framework (Berkeley RISELab); specialist successor to Spark for Amazon BDT's exabyte-scale compactor.
systems/apache-spark — the generalist that Ray is displacing in Amazon BDT's specialist compactor workload; still more reliable (99.91% vs 99.15% first-time success in 2024).
systems/deltacat — Ray project for managed-Iceberg/Hudi/Delta compaction; Amazon BDT contributed The Flash Compactor.
systems/amazon-emr — AWS's managed Hadoop/Spark runtime; substrate for the original Amazon BDT compactor.
systems/aws-glue, systems/aws-glue-for-ray — serverless Spark / Ray runtimes on AWS.
systems/anyscale-platform — commercial Ray runtime.
systems/apache-iceberg, systems/apache-hudi, systems/delta-lake — the three canonical open table formats DeltaCAT targets.
systems/apache-parquet, systems/apache-arrow — the disk + in-memory columnar formats.
systems/amazon-redshift, systems/amazon-athena, systems/apache-hive — SQL engines in the post-Oracle BI stack over S3.
systems/amazon-ion — Amazon's richly-typed self-describing data format; the canonical schema wrapper for the 50+ PB Oracle→S3 migration.
systems/daft — Python+Rust DataFrame library with Ray integration; +24% cost-efficiency win on Amazon BDT's Ray compactor.

ML / computer-vision — SageMaker AI subsystems + adjacent:

systems/aws-sagemaker-ground-truth — managed labelling-job substrate; integrates with data-driven curation via Step Functions triggered by EventBridge.
systems/aws-sagemaker-pipelines — managed ML-workflow orchestration; canonical 7-step pipeline (checkpoint → prep+split → train → drift baseline → evaluate → package → register); model- approval EventBridge event triggers Lambda model-promotion code review cleanly decoupling science + application updates.
systems/aws-sagemaker-endpoint — managed model-serving; the Serverless → Serverful pivot at production scale is the canonical wiki lesson (Serverless lacks GPU + has 6 GB memory cap → OOMs → ml.g6 Serverful endpoints with auto-scaling).
systems/aws-sagemaker-batch-transform — batch-inference runtime; canonical substrate for GLIGEN-based synthetic-data generation at 75K-image-per-use-case scale.
systems/aws-sagemaker-hyperpod — SageMaker's large-scale distributed-training/inference compute substrate with EKS orchestration; host cluster for the Inference Operator and the 2025 observability + training-operator capabilities.
systems/sagemaker-hyperpod-inference-operator — Kubernetes controller reconciling InferenceEndpointConfig + JumpStartModel CRDs on a HyperPod EKS cluster; ships as a native EKS add-on as of 2026-04-06 (previously Helm-packaged). Canonical wiki instance of multi-instance-type fallback via node-affinity, managed tiered KV cache, and prefix-aware / KV-aware intelligent routing as platform-level LLM-serving primitives.
systems/amazon-rekognition — managed CV API; face-detection step in PII anonymisation pipelines under patterns/multi-account-isolation.
systems/gligen — grounded diffusion model producing photorealistic images with ground-truth bounding boxes embedded by construction; on AWS runs on SageMaker Batch Transform.
systems/yolo — single-stage real-time object detector family; default CV serving model in PPE + Housekeeping detection pipelines on SageMaker endpoints.

Web application / analytics / BI:

systems/amazon-cloudfront — global CDN distributing web-app static assets.
systems/aws-appsync — managed GraphQL API; Lambda resolvers for CRUD + embedded analytics integration.
systems/aws-waf — managed web-application firewall; common OWASP-class protection.
systems/amazon-quicksight — BI dashboards; embedded in customer applications via AppSync + Lambda resolvers.
systems/amazon-redshift-spectrum — S3-external-table SQL query layer; powers BI dashboards over S3-resident risk / event data without ETL movement.

Relational DB (beyond the Aurora sovereignty/consistency lineage below):

systems/amazon-aurora — cloud-native Postgres / MySQL- compatible relational engine; the parent line of Aurora DSQL + Aurora Limitless. Common application-state backbone for AWS customer architectures; downstream of ML-inference gatekeeping in CV-safety pipelines.

Compute / storage / integration primitives:

systems/aws-ec2 — canonical substrate for Ray and most non-serverless AWS compute.
systems/aws-s3 — foundational object storage; also the storage half of every compute-storage-separated AWS analytics stack.
systems/s3-vectors — S3's vector similarity-search primitive (preview 2025-07-16); storage-tier cost for embeddings, integrates with Bedrock Knowledge Bases and exports to OpenSearch for hot queries.
systems/amazon-bedrock-knowledge-bases — managed RAG service; pluggable vector stores (S3 Vectors, OpenSearch, Aurora, Pinecone, etc.).
systems/amazon-titan-embeddings — AWS first-party text embedding models on Bedrock (titan-embed-text-v2).
systems/amazon-opensearch-service — managed OpenSearch; k-NN is the hot counterpart to S3 Vectors in cold→hot vector tiering.
systems/amazon-sagemaker-unified-studio — unified data+AI dev environment; Knowledge Bases with S3 Vectors selectable as vector store.
systems/dynamodb — durable state store in the Amazon BDT Ray job-management substrate.
systems/aws-parameter-store — SSM's hierarchical-KV config- store subsystem; EventBridge-emitting change events make it the canonical source-of-truth side of patterns/event-driven-config-refresh (paired with DynamoDB on the high-frequency side of patterns/tagged-storage-routing).
systems/aws-sns, systems/aws-sqs — the pub/sub and queue primitives in the same substrate.

Other AWS / Amazon systems referenced across sources:

Most AWS service lineage lives on companies/allthingsdistributed — S3, EBS, Nitro, Lambda, Firecracker, Aurora DSQL, SageMaker, Bedrock Guardrails, Kiro. Cross-reference there.

Relational databases / Postgres family:

systems/aws-rds — managed relational (MySQL / Postgres / MariaDB / SQL Server / Oracle); Multi-AZ cluster Postgres inherits community Postgres's Long Fork visibility anomaly (Jepsen 2025-04-29; AWS response 2025-05-03).
systems/postgresql — the upstream substrate; visibility model (ProcArray scan, asynchronous with WAL commit) is the root cause. AWS's PostgreSQL Contributors Team (formed 2022) is co-developing the proposed CSN upstream fix.
systems/aurora-dsql — ground-up distributed SQL; replaces ProcArray visibility with time-based MVCC, sidestepping Long Fork. Wire-compatible Postgres via public extension API.
systems/aurora-limitless — horizontally-scaled Aurora Postgres; also replaces ProcArray with time-based MVCC.

Service mesh / container networking:

systems/aws-app-mesh — AWS's first-gen Envoy-based sidecar service mesh for ECS/EKS/Fargate. Discontinued 2026-09-30, closed to new customers 2024-09-24. Four-tier abstraction (Mesh / Virtual Service / Virtual Router / Virtual Node) + customer- managed Envoy sidecar per Task.
systems/aws-ecs-service-connect — current managed replacement for ECS. Flat Client/Server role model, AWS-managed Service Connect Proxy (Envoy under the hood), free CloudWatch app-level metrics. Not yet mTLS-capable (2025-01-18).
systems/aws-vpc-lattice — current replacement for EKS. Not a sidecar mesh — VPC-level service-networking managed control + data plane across EKS / EC2 / Lambda / on-prem.
systems/amazon-ecs — compute substrate under both meshes; Service ↔ Task ↔ Task Definition abstraction; exclusive mesh-membership constraint is load-bearing for migration.
systems/aws-cloud-map — shared service-discovery substrate. Cross-account namespace sharing is not supported, forcing single-account deployments in Service Connect.
systems/aws-private-ca — TLS certificate authority under both meshes; App Mesh uses general-purpose certs, Service Connect uses short-lived certs (cheaper).
systems/amazon-route53 — DNS weighted-routing primitive for blue/green mesh migration edge traffic shifting.

Partitions / cross-partition sovereign-failover architecture:

systems/aws-european-sovereign-cloud — the 2026 EU-resident partition; mandatory separate Organization (cannot be paired into a commercial Organization the way GovCloud can be).
systems/aws-govcloud — 2011 US public-sector partition (FedRAMP / ITAR); the cross-partition-architecture precedent for European Sovereign Cloud; GovCloud accounts can optionally be invited into a commercial Organization.
systems/aws-iam — per-partition identity; credentials don't cross partitions; the load-bearing primitive forcing explicit cross-partition auth design.
systems/aws-sts — per-partition regional endpoints; one of the five named cross-partition auth tactics.
systems/aws-organizations — per-partition topology; European-Sovereign-Cloud-strict / GovCloud-optional separation asymmetry.
systems/aws-control-tower — commercial-partition governance; cannot manage GovCloud or European Sovereign Cloud accounts.
systems/aws-direct-connect — dedicated-line cross-partition connectivity; PoP-to-PoP partner connections as the regulated- workload shape.
systems/aws-transit-gateway — per-partition; inter-region peering does not function across partitions.
systems/aws-privatelink — prescribed secure cross-partition communication primitive on top of the network layer.
systems/aws-config, systems/aws-security-hub — per- partition monitoring / posture substrates.
systems/aws-secrets-manager — IAM-user credential storage in the cross-partition auth IAM-user fallback pattern.

PKI:

systems/aws-private-ca — per-partition CA; cross-partition mTLS requires double-signed (cross-signed) root CAs.

Disaster recovery / resilience (within a partition, cross-Region + cross-account):

systems/aws-backup — unified data-protection control plane tying together per-service backup mechanisms (RDS / EBS / S3 / Aurora etc.) behind vaults + policies + schedules; added first- party backup coverage for services that lacked it (EFS, FSx) and cross-Region backup for DynamoDB; canonical backup-and-restore tier primitive on the DR ladder.
systems/aws-backup-logically-air-gapped-vault — the cyber-resilience-specific vault type within AWS Backup; always-Compliance-mode service-enforced deletion protection ("recovery points can't be deleted by any principal, including the account root user or a compromised administrator, within the retention period"); recovery points stored in AWS service-owned accounts (the vault object in the customer's Recovery Account is just the "governance and access boundary"); direct backup target for S3/DynamoDB/EFS, intelligent-orchestration for EBS/Aurora/FSx; AWS RAM cross-account sharing into the IRE; MPA-gated restore authorization; primary AWS-native option for protecting backup storage from deletion in the cyber-resilience reference architecture.
systems/aws-backup-restore-testing — automated recoverability validation with the PutRestoreValidationResult API as the customer integration hook for workload-specific validation; Layer 1 of the multi-layer validation pipeline; routine drills + incident-time per- candidate validation.
systems/aws-multi-party-approval — service primitive for predefined-approver authorization of high-stakes actions; configured via IAM Identity Center for restore authorization on the air-gapped vault; approval action automatically recorded as a CloudTrail management event; "particularly valuable when the source account might no longer be trusted."
systems/amazon-s3-object-lock — S3-layer service-enforced deletion protection; Compliance mode + S3 Versioning provides equivalent guarantee to the air-gapped vault for S3 data outside the vault's supported resource set; the cyber-resilience fallback for S3 workloads.
systems/amazon-guardduty-malware-protection — backup + volume malware scanning; Layer 2 of the multi-layer validation pipeline; complements workload-specific consistency checks for long-dwelling adversaries that don't generate malware signatures.
systems/aws-cloudtrail — control-plane API audit log; primary investigation-timeline source for cyber-event recovery (Stage 1); also the audit substrate for MPA approval events.
systems/amazon-vpc-flow-logs — data-plane network audit log; complementary investigation-timeline source for lateral movement / exfiltration / C2 detection.
systems/aws-security-incident-response (SIR) — coordinated triage and response support; Stage 1 expert-assistance partner.
systems/aws-iam-access-analyzer + systems/aws-config — cross-account dependency mappers for Stage 5 cutover (IAM trust policies, resource-based policies, KMS key grants, service integrations that point at the old Production Account ID).
systems/aws-resource-access-manager-ram — cross-account resource sharing primitive; how the Recovery Account makes air-gapped vault recovery points available to the IRE.
systems/aws-iam-identity-center — workforce identity centre; MPA-approver substrate + session-revocation primitive in the Rebuild-Restore-Rotate framework's rotate leg.
systems/aws-certificate-manager — TLS/SSL certificate issuance + renewal; certificate re-issuance primitive in the Rebuild-Restore-Rotate rotate leg.
systems/aws-elastic-disaster-recovery (AWS DRS) — continuous block-level replication, recovery orchestration, automated server conversion; seconds RPO, 5–20 min RTO typical; target VPC configuration on recovery; the canonical pilot-light / warm-standby enabling primitive.
systems/arpio — AWS Resilience Competency Partner SaaS; full- workload discovery + backup + cross-Region cross-account recovery on top of AWS Backup + AWS DRS + service-native primitives;

140 AWS resource types covered; the named DR config-translation layer via Route 53 private-hosted-zone CNAMEs.

Event-driven architecture / org-scale pub/sub:

systems/amazon-eventbridge — managed serverless event bus; content-based routing rules + schema registry / discovery + cross-account targets via resource policies; the canonical AWS substrate for event-driven architecture at organisation scale. Load-bearing gap vs a strict- validation requirement: no native schema validation.
systems/amazon-key — physical-access-management product family (In-Garage Delivery, apartment-building access); production instance of patterns/single-bus-multi-account on EventBridge plus a custom schema repository + client library + CDK subscriber constructs library. Reported 2,000 events/s / 99.99% success / 80ms p90 / 14M subscriber calls post-migration; integration time for new use cases 5d → 1d.
systems/aws-cdk — IaC substrate for the reusable subscriber constructs pattern — per-subscriber event bus + cross-account IAM + monitoring + alerting packaged behind a ~5-line new Subscription(...) construct.

Multi-account SaaS platform (account-per-tenant):

systems/aws-stacksets — AWS's fan-out deployment primitive: one CloudFormation template, many target accounts/Regions from a central admin account. Load-bearing for account-per-tenant CI/CD at ProGlove's ~6,000-account scale. Named failure modes: partial rollouts, pipeline duration, tooling maturity edge cases.
systems/aws-codepipeline — central orchestration point for fan-out deployment; single execution triggers a single StackSet update that fans out in parallel.
systems/aws-cloudformation — the underlying declarative IaC engine under both StackSets and CDK.
systems/aws-step-functions — account-creation orchestrator in ProGlove's lifecycle; account-retirement deliberately kept as scripts (architectural asymmetry is the signal).
systems/aws-cost-explorer — transparent per-tenant cost attribution by virtue of the account boundary being the billing boundary; key benefit of account-per-tenant for consumption- priced SaaS.
systems/aws-observability-access-manager — AWS-native cross-account CloudWatch observability primitive; ProGlove built its own third-party aggregation before OAM shipped, now the recommended starting point for new platforms.
systems/proglove-insight — ProGlove's SaaS platform; the canonical wiki production reference for concepts/account-per-tenant-isolation on AWS (~6,000 tenant accounts, 3-person platform team, ~1M Lambda functions).

Internal developer platform / platform engineering on EKS:

systems/santander-catalyst — Santander's in-house IDP on AWS EKS — canonical wiki production reference for platform engineering at large-enterprise regulated-industry scale (160M+ customers, 200+ critical systems, billions of daily transactions). Co-built with AWS ProServe via the Platform Strategy Program (PSP). Provisioning cycle 90 days → hours / minutes; PoC prep 90 days → 1 hour; 100+ pipelines consolidated; GenAI agent stack 105 days → 24 hours; ~3,000 monthly data-experimentation tickets eliminated.
systems/crossplane — CNCF universal resource provisioner; every cloud / SaaS resource modeled as a K8s CR reconciled by a controller; XRDs + Compositions as the composability primitive. Catalyst's stacks catalog.
systems/argocd — CNCF GitOps continuous-delivery controller for Kubernetes; Git as the source of truth; continuous-reconcile loop. Catalyst's data-plane claims component.
systems/open-policy-agent — CNCF policy engine (Rego) + Gatekeeper K8s admission controller; enforces compliance + security at admission time. Catalyst's policies catalog; the regulated-bank analogue of SCPs in ProGlove.
systems/aws-eks — also serves as the infrastructure control plane cluster hosting Crossplane + ArgoCD + OPA; a fundamentally different role from app-compute EKS (Figma, Convera).
systems/databricks — named integration target in Catalyst's modern data platform workload (built-in integration).

AI-for-ops / AI-powered incident response:

systems/aws-devops-agent — AWS's fully managed autonomous AI agent for EKS incident investigation and preventive recommendations. Built on Amazon Bedrock; accessed through a purpose-built web UI behind an Agent Space (tenant configuration unit — IAM + IdP + data-source endpoints + scope). AWS vendor peer to Datadog's Bits AI SRE on the same category axis (hosted agent for live-telemetry incident investigation), with a different vendor relationship (AWS managed service scoped to AWS cloud resources). Canonical wiki reference for telemetry-based Kubernetes resource discovery — agent combines a Kubernetes API scan (graph nodes) with OpenTelemetry-derived runtime relationships (graph edges) into a fused dependency graph used for root-cause analysis.
systems/strands-agents-sdk — AWS's open-source Python SDK for agentic systems (multi-agent orchestration, MCP tool calling, session management); used in the self-build alternative to the DevOps Agent — the Strands variant of the 2025-12-11 conversational-observability blueprint — hosting three specialized agents (Orchestrator / Memory / K8s Specialist).
systems/eks-mcp-server — AWS-Labs-published MCP server exposing Kubernetes / EKS operations as standardized MCP tools; the agent-native interface to a cluster in the Strands variant of the 2025-12-11 blueprint.
systems/fluent-bit — CNCF telemetry forwarder running as a cluster DaemonSet; ingest tier of the telemetry-to-RAG pipeline in the RAG variant of the 2025-12-11 blueprint (Fluent Bit → Kinesis → Lambda + Bedrock embeddings → OpenSearch Serverless).
systems/amazon-kinesis-data-streams — AWS's managed durable streaming substrate; ingest-buffer tier of the same telemetry-to-RAG pipeline. Enables Lambda batching as the primary cost lever at the embedding-generation layer.
systems/amazon-bedrock — managed foundation-model runtime underlying the DevOps Agent.
systems/amazon-managed-prometheus — metrics data source (one of four canonical Agent-Space data sources).
systems/aws-x-ray — traces data source (one of four canonical Agent-Space data sources).

Containers — EKS + Auto Mode + peer AWS services:

systems/eks-auto-mode — managed-data-plane variant of EKS; AWS operates Bottlerocket nodes, default add-ons, cluster upgrades; customer retains node-pool policy + disruption-budget- guarded upgrade contract. Canonical Kubernetes-layer instance of concepts/managed-data-plane.
systems/bottlerocket — container-optimised Linux distro; default AMI under EKS Auto Mode; immutable root + A/B transactional updates.
systems/amazon-guardduty — managed threat-detection with EKS protection + runtime monitoring + CloudTrail + malware detection → MITRE ATT&CK-annotated multistage attack findings.
systems/amazon-inspector — managed vulnerability scanner; ECR-image-to-running-container mapping enables runtime vulnerability prioritisation by actual production exposure.
systems/aws-network-firewall — managed stateful firewall; SNI-based egress allow-listing at per-VPC scale is the canonical concepts/egress-sni-filtering pattern; 2025-11-26 EVS post surfaces the centralised-inspection shape (native TGW attachment, Appliance Mode auto-enabled, Domain-list FQDN rule groups) for hub-and-spoke deployment across many VPCs + on-prem via DXGW.
systems/amazon-evs — managed VMware Cloud Foundation (VCF) stack running on EC2 bare-metal inside a customer VPC; target for lift-and-shift VMware migrations; NSX overlay + vSAN + vMotion all integrated with AWS-native networking.
systems/aws-vpc-route-server — BGP-speaking VPC primitive; bridges overlay networks (NSX inside EVS) to AWS-native VPC route tables so TGW / Network Firewall can route to overlay CIDRs.
systems/external-secrets-operator — CNCF K8s operator that syncs from Secrets Manager to native K8s Secret objects (env-var consumption path; no volume mounts or daemonsets).
systems/amazon-managed-grafana — managed Grafana; Generali uses with CloudWatch data source for per-namespace tenant dashboards.
systems/generali-malaysia-eks — Generali Malaysia's EKS platform as a synthesized case study (Malaysian insurance customer): six peer-AWS-service integration surface + stateless- only + immutable pods + Helm + HPA discipline.
systems/karpenter — CNCF open-source Kubernetes node autoscaler, AWS-originated; canonical wiki production reference is Salesforce's 1,000-cluster / 1,180-node-pool migration (2026-01-12). Solves multi-minute scaling latency, subnet-pinned provisioning, poor AZ balance, and rigid node-group boundaries of the predecessor CA / ASG stack.
systems/cluster-autoscaler — CNCF predecessor autoscaler that Karpenter is displacing on AWS; indirection through ASGs produces minutes-scale latency, thousands of rigid node groups, poor AZ balance.
systems/aws-auto-scaling-groups — AWS EC2 capacity primitive underneath Cluster Autoscaler; Karpenter bypasses.
systems/salesforce — customer with the largest known EKS fleet (1,000+ clusters / 1,180+ node pools); canonical wiki Karpenter-at-extreme-scale production reference.

Key patterns / concepts introduced via AWS blog sources¶

Computer vision + GenAI at scale:

patterns/serverless-driver-worker — canonical instance in the AWS safety-monitoring solution; driver orchestrates, per-use-case workers scale + fail independently; each worker chain is SNS → SQS → SageMaker endpoint with its own DLQ. Inference acts as gatekeeper filtering image volume so Aurora isn't overwhelmed.
patterns/multilayered-alarm-validation — four-stage composition (object detection → zone overlap → loiter-time persistence → confidence + RLE-mask validation) that turns per-frame detections into auditable alarms.
patterns/alarm-aggregation-per-entity — per-(entity, use-case) rollup; append new occurrences to open records; scheduled auto-close on resolution; SLA escalation through per-zone preferred channels.
patterns/data-driven-annotation-curation — Athena-driven FP- rate aggregation + below-threshold-confidence sampling + Claude multi-modal analysis of misclassified samples for class imbalance; replaces blanket per-site daily annotation.
patterns/synthetic-data-generation — GLIGEN + SageMaker Batch Transform producing auto-annotated training data at 75K-image scale per use case; YOLOv8 hits 99.5% mAP@50 for PPE without any manually-annotated real images.
patterns/multi-account-isolation — workload-purpose-axis separation (training / ingest / web-app / analytics each in distinct AWS accounts); distinct from [[concepts/account-per- tenant-isolation]] which is tenant-axis. PII containment + blast-radius + compliance alignment.
concepts/alert-fatigue — named failure mode the alarm- aggregation + multilayered-validation stack is designed around.
concepts/copy-on-write-merge — the compaction strategy that Amazon BDT ran at exabyte scale in-house before the open table formats canonicalised the name.
concepts/change-data-capture — the upstream workload shape driving all of this.
concepts/task-and-actor-model — Ray's programming model, the specialist-enabling lower layer vs Spark's dataflow abstraction.
concepts/locality-aware-scheduling, concepts/zero-copy-sharing, concepts/memory-aware-scheduling — the Ray-mechanism concepts that make specialist hand-crafted distributed algorithms beat generalists on specialist workloads.
concepts/managed-data-plane — the operational-ownership-on- the-data-plane primitive that distinguishes Service Connect / VPC Lattice from App Mesh; canonical AWS instance of the control-plane-vs-data-plane orthogonal axis.
concepts/mutual-tls — notable feature gap in Service Connect vs App Mesh at EOL-transition time; blocks regulated workloads from simple lift-and-shift.
patterns/managed-sidecar — AWS-managed Service Connect Proxy vs customer-managed App Mesh Envoy sidecar; narrowed configurability (timeouts only) for full vendor-operated lifecycle.
patterns/blue-green-service-mesh-migration — forced pattern for App Mesh → Service Connect because an ECS Service can't be in both meshes; edge traffic shifting via Route 53 / CloudFront continuous deployment / ALB multi-target-group.
patterns/shadow-migration — the canonical dual-run reconciliation pattern, instantiated across Amazon BDT's multi-year Spark → Ray migration.
patterns/subscriber-switchover — the per-consumer cutover pattern that earns rollback granularity after shadow migration.
patterns/heterogeneous-cluster-provisioning — Amazon BDT's EC2 capacity pattern: discover an instance-type set, provision whichever is most available, keep workloads arch/hardware-agnostic.
patterns/reference-based-copy-optimization — the "don't rewrite files the compaction didn't touch" optimisation that is a named contributor to Amazon BDT's 82% cost-efficiency gain.

Multi-tenant configuration services (tagged-storage pattern):

patterns/tagged-storage-routing — Strategy-Pattern factory dispatches storage requests to the best-fit backend based on the request key's prefix; adding a new backend is one new class
one map entry; canonical AWS pair is DynamoDB (high-frequency per-tenant) + Parameter Store (shared hierarchical).
patterns/event-driven-config-refresh — EventBridge + Lambda + Cloud Map + gRPC pipeline pushes config updates into live service instances' in-memory caches within seconds without polling or restart; escape valve from the TTL-vs-staleness dilemma for shared-config workloads.
patterns/jwt-tenant-claim-extraction — tenant context sourced exclusively from the validated Cognito JWT's immutable custom:tenantId claim; tenantId in request bodies / paths / headers is never read; cross-tenant access via body manipulation structurally impossible.
concepts/cache-ttl-staleness-dilemma — the forcing function the tagged-storage + event-driven-refresh composite resolves; TTL-based caches for rapidly-changing tenant metadata force an unacceptable stale-vs-amplified-load trade-off at multi-tenant scale.

Postgres consistency-model work:

concepts/snapshot-isolation — the formal model Postgres's clustered implementation does not guarantee (surfaced by Jepsen 2025-04-29, acknowledged by AWS 2025-05-03).
concepts/long-fork-anomaly — the specific SI violation Postgres exhibits; concurrent non-conflicting transactions observed in different orders by primary + replica.
concepts/visibility-order-vs-commit-order — the mechanism: Postgres's commit path writes the WAL record, then asynchronously removes the xid from ProcArray.
concepts/commit-sequence-number — the proposed upstream fix; multi-patch effort, PGConf.EU 2024 talk, AWS PostgreSQL Contributors Team participating.

AI trust / automated-reasoning productization:

systems/bedrock-guardrails-automated-reasoning-checks — Bedrock safeguard that formally verifies LLM outputs against a customer-authored policy; preview-launched 2024-12-04 in US West (Oregon).
concepts/autoformalization — natural-language → formal-spec translation pipeline; first public disclosure in the 2024-12-04 preview-launch post (document → concepts → units → logic → logical model); variable descriptions as the load-bearing accuracy-tuning surface.
patterns/post-inference-verification — the canonical pattern Bedrock Guardrails AR checks productizes; three-verdict output (Valid / Invalid / Mixed) with structured suggestions; regenerate-with-feedback loop feeds the reasoner's natural-language rule descriptions back to the LLM as corrective prompts.

Digital sovereignty / cross-partition failover architecture:

concepts/aws-partition — logically isolated group of AWS Regions with its own IAM; hard boundary for credentials, cross-region primitives, and service availability. The central primitive in sovereign-failover design.
concepts/digital-sovereignty — demand-side framing: "managing digital dependencies — deciding how data, technologies, and infrastructure are used, and reducing the risk of loss of access, control, or connectivity." The human-driven-disaster class that pushes you across the partition boundary.
concepts/disaster-recovery-tiers — backup / pilot light / warm standby / active-active canonical AWS DR ladder; same ladder applied across the partition axis, with pilot-light the cross-partition default.
concepts/cross-partition-authentication — because IAM credentials don't cross, auth is explicit: IAM roles with trust + external IDs, STS regional endpoints, resource-based policies, cross-account roles via Organizations, federation from a centralized IdP (best practice).
concepts/cross-signed-certificate-trust — "double-signed certificates" — per-partition root CAs cross-sign each other to enable authenticated cross-partition mTLS without violating partition isolation.
patterns/cross-partition-failover — the overarching pattern: duplicate infrastructure across partitions + one of the DR tiers + per-partition IAM / PKI / Organizations / networking.
patterns/pilot-light-deployment, patterns/warm-standby-deployment — two specific DR tiers endorsed for cross-partition.
patterns/centralized-identity-federation — federate from a single IdP to all partitions; modern best practice for cross-partition auth; avoids per-partition IAM users.

Disaster recovery / resilience (within-partition):

concepts/rpo-rto — the two DR budget dimensions; AWS DRS quantified at seconds RPO / 5–20 min RTO, AWS Backup at hours RPO / RTO; tier choice derived from the business-set RPO/RTO targets.
concepts/crash-consistent-replication — block-level replica equivalent to a crash+reboot of the source; strictly weaker than app-consistent but achievable continuously — the consistency model AWS DRS uses for its seconds-RPO guarantee.
concepts/cross-region-backup — fault-isolation axis (natural/technical disasters); the baseline multi-Region backup-copy primitive unified under AWS Backup.
concepts/cross-account-backup — compromise-isolation axis (ransomware / malware / malicious insider); AWS Backup cross-account copy is the unified primitive; clean-room recovery account is the canonical target.
concepts/clean-room-recovery-account — separate AWS account with distinct credentials as a ransomware/malware isolation boundary; sibling use of the AWS account boundary alongside concepts/account-per-tenant-isolation.
concepts/dr-config-translation — restored resources have new identifiers (endpoints, ARNs); canonical mechanism is Route 53 private-hosted-zone CNAME indirection so applications keep resolving the old name to the new endpoint without config rewrites.
patterns/block-level-continuous-replication — the continuous seconds-scale replication pattern AWS DRS implements; enables pilot-light + warm-standby tiers at seconds RPO.
patterns/backup-and-restore-tier — the lowest DR tier on the ladder; AWS Backup + EventBridge + Lambda automation; hours-scale RPO/RTO, near-zero steady-state cost.

Event-driven architecture / schema governance:

concepts/event-driven-architecture — architectural style where services communicate via asynchronous events on a shared bus; supersedes ad-hoc SNS / SQS pairs at org scale. The canonical AWS substrate is EventBridge.
concepts/service-coupling — framing for the failure mode EDA addresses: tight-coupling cascade deadlocks. Amazon Key pre-migration exhibited exactly this — Service-A issues → timeouts + retries amplifying load → cross-service deadlock; single-device-vendor issues causing fleet-wide degradation.
concepts/schema-registry — versioned contract store for event definitions; single source of truth for publishers and subscribers. EventBridge has a schema registry but no native validation; strict-validation customers build on top.
patterns/single-bus-multi-account — one shared event bus in a central account + per-service-team accounts; DevOps owns bus + rules + targets, service teams own application stacks; logical separation via rules, not buses. AWS reference pattern.
patterns/client-side-schema-validation — validate events in a shared client library rather than a centralized validation service; immediate developer feedback + no runtime network hop; addresses EventBridge's missing native-validation gap.
patterns/reusable-subscriber-constructs — package subscriber infra as a versioned IaC construct library (CDK) — dedicated event bus + cross-account IAM + monitoring + alerting from ~5 lines. Amazon Key reports publisher/subscriber integration time 40h → 8h.

Fine-grained application authorization:

systems/amazon-verified-permissions — managed Cedar policy engine for application authorization; the application-authz counterpart to IAM. IsAuthorized synchronous evaluation at "millisecond-level"; submillisecond end-to-end when fronted by API Gateway's authorizer-decision cache. Per-tenant policy stores are the idiomatic SaaS isolation primitive.
systems/cedar — the policy language, public extraction of AWS's decade of internal policy-semantics work (see systems/aws-policy-interpreter). Analyzable by design. Combines RBAC + ABAC + ReBAC in one language.
systems/amazon-cognito — identity substrate paired with AVP across Convera's four authorization flows; user pool for customers, machine-to-machine user pool for service-to-service, per-tenant pool for multi-tenant. Pre-token-generation Lambda hook enriches JWTs at issue time.
systems/amazon-api-gateway — ingress tier hosting the patterns/lambda-authorizer; built-in authorizer-decision cache delivers submillisecond repeat-request latency.
systems/okta — external enterprise IdP; federated-to by Cognito in Convera's internal-user flow (patterns/centralized-identity-federation).

Fine-grained application authorization — concepts / patterns:

concepts/fine-grained-authorization — per-resource, per-action, context-aware authorization (vs coarse-grained role-to-endpoint); the evaluation model Cedar + AVP deliver.
concepts/attribute-based-access-control — ABAC as the idiomatic fine-grained authz realization; Cedar combines ABAC with RBAC and ReBAC in one language.
concepts/policy-as-data — Cedar policies in a DynamoDB source of truth + DynamoDB Streams continuously sync into AVP policy stores; authorship gated by a regulated IAM role owned by infosec.
concepts/tenant-isolation — five-layer enforcement chain for Convera's multi-tenant SaaS (identity → token → authorization → routing → data); a bug in any one layer can't leak across tenants.
concepts/zero-trust-authorization — every tier that handles a privileged request independently re-verifies; production instance in Convera's backend pods that re-call AVP before hitting RDS.
concepts/authorization-decision-caching — two-level cache (API Gateway authorizer-decision + app-level Cognito token) delivers submillisecond repeat-request latency.
concepts/token-enrichment — push per-user attribute lookup off the hot path by injecting attributes into the JWT at issue time (via the pre-token hook).
patterns/lambda-authorizer — Lambda in front of API Gateway evaluating Cedar via AVP; the hot-path authz compute across all four Convera flows.
patterns/per-tenant-policy-store — AVP idiom for multi-tenant SaaS: one policy store per tenant, tenant_id → policy-store-id lookup from DynamoDB. Chosen for isolation, per-tenant schema/template customization, easy onboarding/offboarding, and per-tenant resource quotas.
patterns/pre-token-generation-hook — Cognito Lambda trigger that enriches the JWT with authorization-relevant attributes from RDS / DynamoDB at login time.
patterns/zero-trust-re-verification — backend re-runs AVP against the tenant's policy store before data access; data layer (RDS) is further configured to accept only tenant-scoped requests.
patterns/machine-to-machine-authz — same Lambda-authorizer shape reused for service-to-service via Cognito's OAuth client-credentials flow; per-service policy stores.

Internal developer platform / platform engineering at enterprise scale:

patterns/platform-engineering-investment — second canonical production instance on the AWS blog (after ProGlove) via Santander Catalyst; large-enterprise regulated-industry counterpart to ProGlove's small-team SaaS instance. Kubernetes- native substrate on EKS instead of AWS-Organizations-native.
patterns/developer-portal-as-interface — Santander's in-house developer portal as the unified self-service surface hiding EKS / Crossplane / ArgoCD / OPA behind one interface; "Platform APIs become the internal product" in concrete form.
patterns/crossplane-composition — XRDs + Compositions as the unit of reuse for the stacks catalog; Kubernetes-native realization of patterns/golden-path-with-escapes at multi-cloud-infrastructure level.
patterns/policy-gate-on-provisioning — OPA Gatekeeper as a K8s admission controller enforcing compliance + security on every Crossplane claim at manifest-submission time; shift-left compliance; the regulated-industry counterpart to SCPs in ProGlove's AWS-Organizations-based shape.
concepts/universal-resource-provisioning — Crossplane's abstraction: every cloud / SaaS resource as a K8s CR reconciled by a controller; uniform API + RBAC + GitOps across clouds.
concepts/gitops — Git as declarative source of truth, continuous-reconcile controllers; ArgoCD the canonical K8s-native realization; Catalyst's application-delivery contract.
concepts/control-plane-data-plane-separation — Catalyst's first wiki instance of the split at infrastructure- provisioning tier: EKS cluster decides, provisioned AWS (and multi-cloud) resources are the data plane.

AI-for-ops / AI-powered incident response:

concepts/telemetry-based-resource-discovery — AWS DevOps Agent's core methodology: combine a Kubernetes API scan (the graph nodes: Pods / Deployments / Services / ConfigMaps / Ingress / NetworkPolicies with their metadata) with OpenTelemetry-derived runtime relationships (the weighted edges: service-mesh traffic, distributed traces, metric attribution) into a fused dependency graph used for investigation. Neither path alone is sufficient — the static API gives you the graph, telemetry tells you which edges are alive and misbehaving.
concepts/agentic-troubleshooting-loop — iterative LLM ↔ tool-assistant investigation cycle; LLM proposes diagnostic queries, tool assistant executes them against live system state, output re-enters LLM context, repeats until the LLM judges enough context for resolution. Canonical wiki reference is the 2025-12-11 AWS conversational-observability blueprint; the 2026-03-18 AWS DevOps Agent post is the managed-service realization of the same primitive with a structured discovery step added on top.
patterns/telemetry-to-rag-pipeline — streaming operational telemetry into a vector store for LLM augmentation; canonical AWS shape is Fluent Bit → Kinesis Data Streams → Lambda (batched) + Bedrock Titan Embeddings v2 → OpenSearch Serverless (hot) or S3 Vectors (cold). Sanitize-before- embedding is named as the vector-store governance boundary.
patterns/allowlisted-read-only-agent-actions — constrain an LLM-driven agent's side effects to a static allowlist of read-only verbs (kubectl get / describe / logs / events) + platform-layer RBAC. Canonical AWS realization is the in- cluster troubleshooting assistant pod in the 2025-12-11 blueprint; defense-in-depth via two-layer enforcement (app allowlist + K8s RBAC).

Agentic AI development (developer-side feedback loops):

concepts/agentic-development — development model where the AI agent "writes, tests, deploys, and refines code through rapid feedback cycles", not just suggests snippets. Inner-loop driver, not outer-loop. The 2026-03-26 AWS post's central reframing: agentic coding is gated on architecture, not prompt quality.
concepts/fast-feedback-loops — the primary architectural constraint of agentic development; each unvalidated change should use the cheapest tier that can falsify it. Five tiers named: local emulation → offline data/ML dev → hybrid cloud → preview env → production deploy.
concepts/local-emulation — umbrella concept over SAM sam local start-api, same-image container run, DynamoDB Local, Glue Docker images. Cheapest feedback tier; API-shape parity with real services.
concepts/contract-first-design — OpenAPI specifications authored upfront so agents validate integrations before sibling services are implemented; pairs with preview environments.
concepts/hexagonal-architecture — codebase layer discipline (/domain no Amazon deps, /application orchestration, /infrastructure adapters). The precondition that makes domain- layer unit tests run without cloud credentials.
concepts/project-rules-steering — architectural constraints / coding conventions as Markdown the agent consults automatically. First AWS source pinning .kiro/steering/*.md + Markdown format — Kiro's concrete surface.
concepts/machine-readable-documentation — AGENT.md / RUNBOOK.md / CONTRIBUTING.md + YAML-over-prose as the broader design principle; project rules as one realization.
patterns/local-emulation-first — prefer local emulator over cloud deployment as the default feedback path; canonical four realizations (SAM / containers / DynamoDB Local / Glue Docker).
patterns/hybrid-cloud-testing — for services without local emulators (SNS / SQS named), define minimal CFN / CDK stacks and invoke via SDK. Cloud is "another test dependency — used sparingly and predictably".
patterns/ephemeral-preview-environments — short-lived IaC- defined stacks, deployed on demand, torn down after E2E validation. The above-hybrid-cloud tier.
patterns/layered-testing-strategy — unit (domain, fast) / contract (interfaces) / smoke (deployed env). Each tier catches a distinct failure class.
patterns/tests-as-executable-specifications — tests do more than catch regressions — they define acceptable behavior; a failing test teaches the agent what's expected. Sibling of patterns/executable-specification at the test-suite tier.
patterns/ci-cd-agent-guardrails — required tests + automated reviews + branch protections + preview-env validation + human gates for high-impact changes; expand agent autonomy as confidence compounds.

Recent articles¶

Most recent first; ingested AWS blog posts, not republications from companies/allthingsdistributed.

2026-06-02 — AWS Architecture Blog, Automating contract intelligence with Doczy.ai on AWS. AArete partner-solution post canonicalising Doczy.ai — a production document-intelligence SaaS on AWS that has processed 2.5 million contract documents (50 million pages) with 137 million Bedrock API calls and 442 billion tokens over 22 months at ~250 000 contracts/week peak with 99% extraction accuracy (vs ~55% for the predecessor rules-based system on the same corpus). Wiki's first canonicalisation of the managed AI document-intelligence pipeline on AWS — nine-service composition Cognito → S3 → Lambda → Textract → custom ECS-hosted processing layer ( smart chunking + dual clustering + file-class routing) → Bedrock LLM extraction → Snowflake structured-data sink → intelligent dashboards + downstream automation (claims-system configuration, vendor-invoice verification). Three new wiki algorithmic primitives canonicalised: concepts/smart-chunking (hierarchy-and-one-to-many-preserving chunking, distinct from metadata-only-embedding and embedding-signal-dilution avoidance); concepts/dual-clustering-document-intelligence (two-lens semantic + structural clustering with projection-based fusion — the load-bearing accuracy contributor; sibling to Netflix Service Topology's multi-source fusion at a different altitude); and concepts/prompt-optimization-feedback-loop (continuous prompt edits on production outputs, distinct from agent self-correction loops, GEPA-style training-time prompt search, and model fine-tuning). New system page systems/amazon-textract as the canonical OCR + metadata layer (first dedicated wiki disclosure). New company page companies/aarete as a partner-shape Tier-3 entry. Sibling cross-refs: patterns/two-pass-classify-then-deep-extract (Databricks Unlocking the Archives 2026-05-11) at the classify-then-extract axis; concepts/multi-source-topology-fusion (Netflix Service Topology 2026-05-29) at the multi-view-fusion axis; concepts/metadata-only-embedding (Yelp CS Chatbot 2026-05-27) at the chunking-strategy axis. Borderline partner-PR scope but passes on operational numbers depth and reusable AWS service-composition pattern. Source: sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws.
2026-05-20 — AWS Architecture Blog, Cyber resilience on AWS: A reference approach for recovery from ransomware and destructive events. First wiki canonicalisation of a complete cyber- resilience reference architecture for recovering AWS workloads after ransomware, data extortion, or other destructive events. Five architectural primitives: (1) three- account isolation topology — Production / Recovery / Isolated Recovery Environment (IRE) inside one AWS Organization, with the IRE having "no trust relationship to the Production Account, no VPC peering to it, and no internet-facing resources" and using PrivateLink for AWS service API access only; (2) AWS Backup logically air-gapped vault in Compliance mode — service- enforced deletion protection where "recovery points can't be deleted by any principal, including the account root user or a compromised administrator, within the retention period"; recovery points stored in AWS service-owned accounts (the vault object in the customer's Recovery Account is just the "governance and access boundary"); direct backup target for S3/DynamoDB/EFS, intelligent-orchestration for EBS/Aurora/FSx; S3 Object Lock in Compliance mode + S3 Versioning as out-of-vault fallback; AWS RAM cross-account sharing into the IRE; (3) multi- layer validation pipeline — five layers combining AWS Backup Restore Testing (with the PutRestoreValidationResult API as the customer hook), GuardDuty Malware Protection, AWS Marketplace partner content scanners, workload-specific consistency / invariant / configuration-diff checks, and CloudTrail-driven log/audit review across the backup window; conjunctive acceptance ("both AWS-native validation and workload-specific validation should pass"); pipeline runs inside the IRE for containment; (4) five-stage workflow with Stages 1+2+4 rebuild running in parallel — investigation timeline construction (Stage 1: CloudTrail + VPC Flow Logs + GuardDuty + Security Hub + workload logs + SIR expert assistance) + candidate validation (Stage 2: reverse-chronological scan past the event boundary) + IaC rebuild (Stage 4: from separate version-controlled repository) — gated by an MPA approval (Stage 3) before validated data restore proceeds and Stage 5 cutover via DNS health checks with cross-account-reference updates (IAM Access Analyzer
AWS Config for inventory); (5) Rebuild-Restore- Rotate framework — "Infrastructure is code. Data is backup. Credentials are new" — three-category sorting of recovery targets with the load-bearing caveat that the IaC source itself must not have been the attack target ("recovery starts further upstream with a trusted copy of source"); rotate leg uses Secrets Manager + IAM Identity Center session revocation + Certificate Manager; two-category services (S3, DynamoDB) split between rebuild (config) and restore (data); derived data stores (search indexes, caches, materialised views) regenerate as a recovery dependency. [[concepts/compromise-boundary-recovery- point-selection|Recovery-point selection]] is reverse- chronological from the most recent backup predating the event boundary (the earliest plausible indicator from the investigation timeline) — verbatim "the most recent working copy is often a better target. If an adversary was present in the environment before detection, backups taken during that window might carry the same issues." Retention has to be sized to detection latency, not just routine RPO. Seven-step starting checklist ends with "Exercise the full workflow, including investigation, validation, rebuild, restore, and cutover, on a regular schedule" — the load-bearing item, because cyber events are rare and recovery muscle memory has to come from drills. First wiki appearance of: cyber resilience as a named recovery posture distinct from prevention/detection; Isolated Recovery Environment (IRE); logically air-gapped vault in Compliance mode; AWS Multi-party approval (MPA) as a service primitive; AWS Backup Restore Testing's PutRestoreValidation Result API; GuardDuty Malware Protection's backup + volume scanning modes; the Rebuild-Restore-Rotate framework; the five- stage parallel recovery workflow; event-boundary-driven recovery- point selection; AWS Security Incident Response (SIR); IAM Identity Center as MPA approver substrate + session-revocation primitive; AWS RAM as the Recovery-Account → IRE vault sharing primitive; IAM Access Analyzer + AWS Config as cutover-time cross-account dependency mappers. Cross-source continuity: direct sequel to 2026-03-31 streamlining-DR (which canonicalised AWS Backup, AWS DRS, cross-Region/cross-account axes, clean-room recovery account); this 2026-05-20 post adds the cyber-event-specific layer on top of those general DR primitives. (Source: sources/2026-05-20-aws-cyber-resilience-on-aws-a-reference-approach-for-recovery-from-ransomware-and-destructive-events.)
2026-05-19 — AWS Architecture Blog, How Synthesia optimizes generative AI video inference on Amazon EC2 G7e instances (co-authored with Synthesia Research Engineering). First wiki canonicalisation of GPU-pipeline overlap optimisation at video-frame altitude. Customer-as- co-author disclosure: Synthesia hosts in-house latent-diffusion video generation models on EC2 G7e (NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM, $3.36 / GPU-hour in us-east-2) for the VRAM headroom on GPU-memory-intensive generative video models. The bottleneck is the VAE decoder stage: chunked decoding (one latent frame → ~4 pixel frames per chunk) forces per-chunk D2H + storage-write between consecutive kernel launches, leaving the GPU idle ~18% of wall-clock waiting for the CPU thread to drain frames to disk. The fix — the patterns/asynchronous-frame-generation-pipeline — uses three primitives in concert: (1) two CUDA streams — Compute Stream
dedicated Copy Stream — to overlap kernels with D2H on the GPU's separate compute and copy engines; (2) double buffering on both VRAM and pinned host RAM, with CUDA events as cross-stream barriers; and (3) a dedicated worker CPU thread that drains pinned host buffers to disk while the main Python thread keeps launching kernels. Wiki-attested gain on g7e.2xlarge with the unoptimised Hugging Face Diffusers Wan 2.2 14B VAE decoder (10 consecutive 41-latent-frame decode runs after warmup): GPU kernel utilisation 82% → 99.9% (steady state, two consecutive chunks); decode latency mean 21.99 → 20.17 s/video (−8.2%); P99 22.01 → 20.20 (−1.79 s); Real Time Factor 3.21 → 2.95. Theoretical saving of ~$896 per 1,000 hours of decoded video per single GPU. The post is explicit that the technique generalises: "not specific to the Wan architecture, nor to the specific GPU utilized" — any chunked video-generation pipeline that transfers frames to host can apply it. AWS published a PyTorch reference implementation (aws-samples/sample-asynchronous-video-decoding). Cross-source continuity: same overlap shape as the 2026-05-08 Databricks / Superhuman post canonicalised at LLM-batch-serving altitude (concepts/async-cpu-gpu-pipelined-scheduling) — both cases overlap GPU compute with CPU post-processing, but at different granularities (LLM batches vs video frames) and with different scheduling primitives (LLM scheduler batch dispatch vs explicit dual CUDA streams + pinned buffers + worker thread). First wiki appearance of: G7e instance family, NVIDIA RTX PRO 6000 Blackwell, Wan 2.2 14B model, VAE decoder as a named architectural component, GPU kernel utilisation as a distinct metric (vs MFU), CUDA stream as a distinct primitive, pinned/page-locked memory as a distinct primitive, D2H transfer as a distinct concept, latent- diffusion video generation as an algorithmic shape, Real Time Factor as a video-decode performance metric, and Synthesia as a wiki-tracked company. (Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.)
2026-05-13 — AWS Architecture Blog, Streaming CloudWatch metrics to VPC-based OpenTelemetry collectors using Lambda. Reference architecture + design rationale for an unnamed-customer push-based observability pipeline that replaces a Prometheus + CloudWatch-exporter pull pipeline (which hit API throttling and "caused metric loss and created gaps in observability data for business-critical systems") with a five-primitive push pipeline: CloudWatch Metric Streams (JSON format) → Amazon Data Firehose (with S3 zero-cost fallback) → Lambda transform (synchronously invoked per record batch) → internal NLB → OpenTelemetry collector container fleet on EC2 inside the customer's VPC, with the collector exporting onward to Grafana Cloud / AWS X-Ray / Honeycomb / Lightstep. Load-bearing architectural property: Firehose's HTTP endpoint destination is public-only — "these endpoints must be public — they cannot be private endpoints inside a VPC" — so the Lambda transform doubles as the bridge from public Firehose delivery to a private VPC endpoint. Customer's data-privacy requirement was that the metric data and the OTel collector remain VPC-internal. Canonicalises three new wiki primitives: systems/amazon-cloudwatch-metric-streams (the push-export primitive on top of CloudWatch — supports OTel 0.7 / 1.0 / JSON output formats), systems/amazon-data-firehose (the managed capture-buffer-transform-deliver pipe; sync-Lambda-transform contract; S3 fallback), systems/aws-distro-for-opentelemetry (AWS's OTel distro, named on-ramp). Two new concept pages: concepts/push-vs-pull-monitoring (observability-altitude pull-vs-push trade-off; sibling to concepts/pull-vs-push-streams at the JS-streams altitude), concepts/opentelemetry-collector-three-stage-pipeline (receivers → processors → exporters as the load-bearing vendor-neutral fan-out abstraction). Two new patterns: patterns/firehose-lambda-transform-as-vpc-bridge (re-purpose Firehose's data-transformation hook as a VPC-private-endpoint bridge), patterns/cloudwatch-metric-stream-to-vpc-otel-collector (the composite reference architecture). Extends systems/aws-cloudwatch (Metric Streams sub-service + push-vs-pull contrast), systems/aws-lambda (Firehose synchronous-transform invocation contract — Lambda's third structurally distinct event-source role), systems/aws-nlb (internal-subnet ingress for a horizontally scaled OTel collector fleet), systems/opentelemetry (collector internals + VPC-self-host shape). No operational numbers disclosed — "sub-minute latency for real-time alerting" is asserted as the requirement and as the satisfied outcome but no measured p50 / p99 is given. The post is a reference architecture + CloudFormation walkthrough, not a production retrospective. → sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda
2026-05-12 — AWS Architecture Blog, Building hybrid multi-tenant architecture for stateful services on AWS. Production architecture retrospective from an AWS-internal stateful ad-serving platform ("millions of requests per second", "billions of dollars in annual advertising revenue") migrating off a per-tenant-AWS-account cellular architecture to a hybrid multi-tenant architecture with cluster-level isolation inside shared AWS accounts. Names the fourth architectural- decision axis on AWS multi-tenancy (alongside single-vs- multiple-organizations / account-per-tenant-within-org / cross-partition): isolation grain within a shared account, with dedicated ECS cluster as the per-tenant compute boundary and shared VPC + ALB
IAM + PrivateLink at tier level. The forcing function is in-memory tenant state — verbatim: "When two tenants share a cluster, their in-memory data competes for the same heap. A tenant with a large dataset can trigger out-of-memory conditions that affect its neighbors." Three-level scaling hierarchy canonicalised: tier → cell → infra group — with "two independent scaling levers" (add infra groups to scale within a cell / account; add cells to scale across accounts). Load-bearing architectural decision: "we pre-wire downstream service dependencies at tier creation, not at tenant onboarding" — canonicalised as concepts/pre-integration-at-tier-creation and cited verbatim as "the primary reason for the 80 percent reduction in infrastructure setup steps." Measured outcomes: tenant onboarding 52 days → 7 days (−86%); infrastructure setup steps −80%; engineering effort per onboarding −80%; feature release 2–3 days → 1 day; up to 100 tenants per AWS account. Capacity math disclosed: ALB quotas (100 target groups per LB, 5 TGs per rule, 20 rules → 100 TG capacity → ~50 tenants per infra group); up to 5 ECS clusters per tenant; up to 100 ECS clusters per infra group; 10,000 Route 53 weighted records per hosted zone; 5,000 ECS tasks per service (applies per tenant because cluster is single-tenant). Routing: Route 53 weighted DNS at the stable tier endpoint absorbs both horizontal-scale-out levers transparently (adding infra groups within a cell / adding cells across accounts). ALB listener rules route per tenant (path or header) to per-tenant target groups (patterns/alb-path-routing-per-tenant). Observability shape is two-altitude: tenant-level ( CloudWatch dimensions per ECS service: memory 70/85 warn/crit, TargetResponseTime 100–200 ms baseline + 2× alert, 5XX per target group) + tier-level (ALB ActiveConnectionCount, ProcessedBytes, Route 53 health checks). Structured CloudWatch Logs with tenant_id / tier_id / region fields in every entry; single log group per tier. Before-state quantification: per-tenant AWS account cellular architecture ran at 3% avg CPU / 19% avg memory / 98% wait across 181 targets for 18 clients × 4 regions; 52-day onboarding breakdown (2 wks AWS account + 3 wks VPC/networking + 1 wk IAM + 2 wks integration/testing). Introduces concepts/hybrid-multi-tenant-architecture (central new concept), concepts/cluster-level-tenant-isolation (new isolation-grain concept — sixth shape on the wiki's tenant- isolation spectrum), concepts/in-memory-tenant-state (the stateful-service property driving the grain decision), concepts/pre-integration-at-tier-creation (the load-bearing structural-decoupling concept), concepts/tenant-onboarding-time (new first-class business metric with 52d → 7d before/after disclosure), patterns/hybrid-multi-tenant-architecture (the central pattern), patterns/tier-cell-infra-group-hierarchy (three-level scaling structure), patterns/dedicated-ecs-cluster-per-tenant, patterns/alb-path-routing-per-tenant, patterns/shared-privatelink-at-tier-level, and patterns/configuration-driven-tenant-onboarding. Extends systems/amazon-ecs (cluster-per-tenant isolation boundary), systems/amazon-route53 (horizontal-scale-out via weighted records, not just migration-phase blue/green), systems/aws-alb (per-tenant listener rules for production SaaS, sibling to Deloitte's vCluster test-environment variant), systems/aws-privatelink (tier-level shared endpoints with cost amortisation: ~$7.30/mo / 50 tenants ≈ $0.15/tenant/mo), systems/aws-iam (tier-level shared IAM roles), systems/aws-cloudwatch (tenant-dimensioned metrics + logs), concepts/noisy-neighbor (sixth response axis on the wiki: cluster-level isolation in shared accounts), concepts/cell-based-architecture (AWS-account-as-cell in three-level hierarchy), and concepts/tenant-isolation (sixth shape on the isolation spectrum). → sources/2026-05-12-aws-building-hybrid-multi-tenant-architecture-for-stateful-services
2026-05-11 — AWS Architecture Blog, Choosing between single or multiple organizations in AWS Organizations (Gwendolyn Du). Architectural decision-framework post distilling a cloud- migration advisor's customer conversations into an explicit rubric for the single-vs-multiple AWS Organizations decision. Names the third architectural-decision axis on AWS Organizations topology (alongside the cross-partition shape canonicalised by the 2026-01-30 Sovereign Failover post and the within-one-org account-per-tenant shape canonicalised by the 2026-02-25 ProGlove post): one Organization with OU hierarchy vs multiple Organizations as parallel governance domains. Frames the decision as operational efficiency vs risk isolation and comes down on a specific default: "I recommend starting with a single organization and only considering multiple organizations when the isolation requirements outweigh the operational benefits of centralization" — canonicalised as patterns/start-single-organization-default. Discloses a ten-criterion trade-off table (billing / governance / security isolation / access management / scalability / resource sharing / operational overhead / compliance-and-audit / risk isolation / flexibility) and names four use-case shapes per direction: single pushes on centralized visibility + shared corporate security policy + consolidated billing with volume discounts + resource sharing; multiple pushes on independent business units / regulated-industry entity segmentation (banking / insurance / healthcare) / M&A / maximum blast-radius containment. Names sandbox- organization-for-SCP-testing explicitly as a non- regulatory reason to run a second organization — a distinct pattern from the regulatory segmentation shape. Worked examples disclosed: global retailer with regional teams (single-org), and multinational financial services with retail banking + investment banking + insurance divisions under separate regulatory bodies (multiple-org — patterns/separate-organization-per-regulated-division). Four-step governance model stated explicitly (add accounts → group into OUs → apply policies → enable services) — first explicit wiki disclosure of this canonical flow. Prescriptive- decision post, not retrospective — no production numbers, no customer case-study arc, no quantification of volume-discount magnitude, no migration mechanics between shapes. Introduces concepts/single-vs-multiple-organizations-decision (new canonical decision axis), concepts/operational-efficiency-vs-risk-isolation-tradeoff (new canonical framing of the core design tension), concepts/consolidated-billing-volume-discount (new canonical concept; AWS-specific economic pull toward single-organization), concepts/organizational-unit (new first-class OU concept page), patterns/start-single-organization-default (AWS's recommended posture as named pattern), patterns/sandbox-organization-for-scp-testing (non-regulatory multi-org use case), and patterns/separate-organization-per-regulated-division (regulated-industry / M&A / subsidiary deviation shape). Extends systems/aws-organizations (third topology-axis: single-vs- multiple), concepts/service-control-policy (sandbox- testing preflight use case), and concepts/blast-radius (multiple-organizations as largest non- partition blast-radius containment shape on AWS). → sources/2026-05-11-aws-choosing-between-single-or-multiple-organizations
2026-04-27 — AWS Architecture Blog, Deloitte optimizes EKS environment provisioning and achieves 89% faster testing environments using Amazon EKS and vCluster. Reference- architecture customer case study canonicalising the shared- host-cluster-with-virtual-clusters topology. Deloitte replaced one-dedicated-EKS-cluster-per-QA-testing- need with one shared EKS host cluster running EKS Auto Mode, partitioned into 50+ lightweight vCluster virtual Kubernetes clusters. Essential platform services (ALB controller, CSI drivers, monitoring agents) deploy once on the host and are shared across all virtual clusters via the vCluster syncer. Provisioning time 45 min → <5 min (89% reduction); ~500 QA engineer-hours / year reclaimed; >50 vCPU + >200 GB RAM saved at peak from non-duplicated controllers; up to 70% additional savings from EC2 Spot via Auto Mode; >10 separate tool deployments per environment collapsed into one shared stack. Single internet-facing ALB with ACM-terminated HTTPS serves applications across all 50+ virtual clusters via path-based routing (/app1, /app2, …), using the alb.ingress.kubernetes.io/group.name: vcluster IngressClassParams to aggregate Ingress objects from every virtual cluster into one ALB listener — see patterns/shared-alb-path-based-multi-cluster-routing. The self-service property (QA engineers provision their own environments without platform-team involvement) is delivered by the commercial vcluster/vcluster-platform v4.0.1 Helm chart's UI; this is the canonical case study for concepts/self-service-infrastructure dissolving the concepts/platform-team-bottleneck at the testing- environment altitude. Canonical wiki anchor for concepts/virtual-kubernetes-cluster (new), concepts/platform-team-bottleneck (new), systems/vcluster (new), and the three shared-cluster-with- vclusters patterns (new). → sources/2026-04-27-aws-deloitte-optimizes-eks-environment-provisioning-with-vcluster
2026-04-23 — AWS Architecture Blog, Modernizing KYC with AWS serverless solutions and agentic AI for financial services (Jayanth Kolli, Andrew Black — IBM + AWS). Reference architecture for a cloud-native, event-driven, multi-agent Know Your Customer (KYC) orchestration framework. One Supervisor Agent dynamically planning execution across five domain sub-agents (Identity Verification, Document Analysis, Fraud Detection, Compliance & Risk, Customer Experience) running on Amazon Bedrock AgentCore, with Amazon MSK as the async streaming backbone (inbound / outbound topic pairing
event-listener preprocessing), Lambda as the MSK → AgentCore async-invocation shim, OpenSearch Serverless as the vector store for regulatory-corpus RAG over S3-stored regulations (BSA, USA PATRIOT Act, AMLD, MAS, FATF), and DynamoDB as the sub-millisecond real-time decision state. Three load-bearing pattern disclosures: (1) the Supervisor + sub-agent orchestration with dynamic parallel-or- sequential execution planning + three-band confidence-tiered routing (>95 % → auto-approve, 75-95 % → additional verification, <75 % → human review with full context); (2) the MSK → Lambda → AgentCore async-invoke loop that compresses KYC latency from 3-5 days → sub-5 minutes via parallel sub-agent execution of Doc Analysis + Identity Verify + Fraud Detect; (3) the OpenAPI-declared tool contract invoked via AgentCore Gateway + Lambda targets for five on-prem financial-system classes (Customer Management, Transaction Monitoring, Case Management, Risk/AML, Core Banking) bridged over AWS Direct Connect / Site-to-Site VPN. Thickest on-wiki disclosure of AgentCore as a platform — names four sub-surfaces (Runtime, Memory, Gateway, Identity) in one architecture. Opens AWS's regulated-financial- services agentic-AI axis and canonicalises KYC as an on-wiki concept. Reference-architecture genre: IBM is AWS's Premier Tier Services Partner; no production numbers — the 3-5-day → sub-5-minute target is a design goal, not a measurement. → sources/2026-04-23-aws-modernizing-kyc-with-aws-serverless-solutions-and-agentic-ai
2026-04-22 — AWS Architecture Blog, PACIFIC enables multi- tenant, sovereign product carbon footprint exchange on the Catena-X data space using AWS (Anil Akarsu, Dr. Renè Holschuh — BASF/CircularTree + AWS). Customer case study for PACIFIC, a BASF + CircularTree joint product running on Amazon ECS on AWS Fargate that lets automotive-supply-chain companies exchange Product Carbon Footprint (PCF) data across the Catena-X data space while keeping per-company data sovereignty. Three load-bearing architectural moves: (1) the Cognito-group → IAM-role mapping (Cognito user-pool group → identity pool → per-tenant IAM role → STS temporary credentials) delivers tenant-isolated access to Secrets Manager inside a single AWS account — no account-per-tenant overhead; (2) the per-tenant pcf-exchange-module publishes each supplier's cross- company data-exchange API as an independently-addressable URL that accepts only OAuth2 tokens issued via a successful Eclipse Dataspace Connector (EDC) policy-negotiation handshake — two independent authorization gates (protocol + token); (3) the integration-module decouples messy per-supplier PCF ingest (OAuth2/cert/API-key auth flows, all credentials in Secrets Manager) from the Catena-X exchange path, landing PCF data in S3 under company-specific prefixes guarded by the same IAM role chain. Extends concepts/tenant-isolation with the fifth canonical shape (IAM-role-per-tenant-via-Cognito) and concepts/digital-sovereignty with the cross-tenant-within- data-space variant. Operational numbers are business-scale: manual PCF exchange of a cached dataset takes up to ~7 days → PACIFIC responds in seconds (up to 75% time savings); +80% newly onboarded companies 2024→2025; +55% PCF exchange growth over the same period. Customer case study genre — no p99/TPS/cost-per-tenant numbers. → sources/2026-04-22-aws-pacific-multi-tenant-sovereign-pcf-exchange-catena-x
2026-04-21 — AWS Architecture Blog, Real-time analytics: Oldcastle integrates Infor with Amazon Aurora and Amazon Quick Sight (Kiran Reid, Sankar Das, Avdhesh Paliwal). Customer case study from Oldcastle APG — one of North America's largest construction-materials suppliers, 150+ facilities. Oldcastle migrated from on-prem to Infor Cloud ERP on AWS and hit a reporting gap (Infor's configuration-based reporting covered "minimal reports of our operational needs" vs. hundreds of on-prem real-time reports). The solution is a real-time streaming analytics stack that pulls CDC from Infor's Data Fabric Stream Pipelines into Aurora PostgreSQL and powers QuickSight dashboards embedded inside Infor OS. Three load-bearing architectural moves: (1) the static-EIP NLB + EC2 iptables NAT router + RDS Proxy + private Aurora ingress shape solves external SaaS → private-VPC database without Direct Connect or VPN, and insulates the external-facing allowlist from Aurora failover-driven private-IP churn; (2) API Gateway + Lambda + GenerateEmbedUrlForRegisteredUser delivers embedded QuickSight inside Infor OS with per-viewer signed URLs carrying row-level-security context; (3) JSONB as streaming-ingest buffer on Aurora PostgreSQL pairs schemaless CDC ingest with selective indexing for query performance. Also introduces NDJSON as the CDC wire format, SPICE as the QuickSight hot-aggregate cache, and concepts/static-ip-allowlisting + concepts/cross-vpc-private-connectivity + concepts/signed-embed-url + concepts/row-level-security + concepts/jsonb-streaming-buffer + concepts/embedded-analytics as canonical vocabulary. Numbers: 50+ dashboards in 8 months, 100+ concurrent users, millions of transactions/day in real time, "subsecond" SPICE response. Customer-case-study genre — no p99 latency, IOPS, or embed-rate distribution disclosed. Introduces the Network Load Balancer and Infor Data Fabric Stream Pipelines as first-class wiki entries. → sources/2026-04-21-aws-oldcastle-infor-aurora-quicksight-real-time-analytics
2026-04-07 — AWS News Blog, Launching S3 Files, making S3 buckets accessible as file systems (Sébastien Stormacq). The customer-visible launch announcement for Amazon S3 Files — operational/product-launch companion to Warfield's same-day design essay on companies/allthingsdistributed. Key concrete numbers this post owns (the design essay omitted them): NFS v4.1+ as the wire protocol, ~1 ms latency for active data on the high-performance storage tier, and Amazon EFS confirmed explicitly as the under-the-covers backing for that tier. Introduces the Mount Target as a new deployment primitive — a VPC network endpoint between compute (EC2 / ECS / EKS / Lambda) and the S3 file system; console auto-creates mount targets, CLI requires two commands (create-file-system + create-mount-target). Mount syntax is two shell lines (mount -t s3files fs-<id>:/ <mount-point>). Performance-tier split inside S3 Files: files that benefit from low-latency access are stored/served from the high-performance (EFS-backed) tier; files needing large sequential reads are served directly from Amazon S3 to keep the full S3 throughput envelope without leaving read(2). Byte-range reads transfer only requested bytes for random-access patterns. Intelligent pre-fetching with per-file customer control over load-full-data-or-metadata-only (the concepts/lazy-hydration surface from the design side). Bidirectional sync cadence: file → S3 within minutes; S3 → file within seconds, occasionally a minute+ (matches concepts/stage-and-commit's ~60 s commit interval from the design side). Amazon FSx positioning made explicit — S3 Files targets interactive shared access to S3-resident data; [[systems/ aws-fsx|FSx]] remains the path for on-prem NAS migration, HPC/GPU cluster storage (Lustre), and NetApp ONTAP / OpenZFS / Windows File Server-specific capabilities. Agentic AI called out as a first-class target workload — "building agentic AI systems that rely on file-based Python libraries and shell scripts" — matching Warfield's "coding agent reasoning over a dataset" framing. Pricing is four line items: file-system storage, small-file reads, all writes, and S3 requests during sync (large-file direct-S3 reads fall under normal S3 GET pricing). GA in all commercial AWS Regions at launch. Customer-facing launch — no SLOs, no breakdown of the high-performance-vs-direct-S3 per-file routing heuristic, no 50 M-object mount warning or 30-day eviction mentioned (those are in Warfield's essay). Extends systems/s3-files, systems/aws-efs, systems/aws-fsx, systems/aws-s3; extends concepts/file-vs-object-semantics, concepts/boundary-as-feature, concepts/stage-and-commit, concepts/lazy-hydration; extends [[patterns/presentation-layer- over-storage]], patterns/explicit-boundary-translation. Source: sources/2026-04-07-aws-s3-files-mount-any-s3-bucket-as-a-nfs-file-system-on-ec2-ecs.
2026-03-26 — AWS Architecture Blog, Architecting for agentic AI development on AWS. Prescriptive essay on how to architect AWS systems so AI coding agents can operate effectively. Thesis reframes "better AI coding" from a prompt problem to an architectural problem: "The solution is not better prompts, it's an architecture that treats fast feedback and clear boundaries as first-class concerns." Two co-equal axes: (1) system architecture for fast agentic feedback — local emulation as default via SAM sam local start-api (Lambda + API Gateway), same-image container run for ECS / Fargate, DynamoDB Local for CRUD tests, AWS Glue Docker images for data pipelines; offline development for data/ML workloads as the same shape for reduced-data iteration; hybrid cloud testing for services without emulators (SNS/SQS named) — minimal CloudFormation / CDK stacks invoked via SDK, torn down after; preview environments as short-lived IaC stacks for E2E validation; contract-first design via OpenAPI so agents can validate integrations before all services ship. (2) Codebase architecture for AI-friendly development — hexagonal architecture with explicit /domain + /application + /infrastructure layers (domain has no Amazon deps); project rules / steering files at .kiro/steering/*.md (Kiro's concrete surface — first AWS source to pin the path + Markdown format; example rule: "database access must go through repository classes in the infrastructure layer"); tests as executable specifications across unit + contract + smoke tiers (patterns/layered-testing-strategy); concepts/monorepo + machine-readable docs (AGENT.md / RUNBOOK.md / CONTRIBUTING.md + YAML-over-prose); CI/CD guardrails that scale agent autonomy over time while keeping humans in the loop for high-impact decisions. Prescriptive — no production numbers, no customer case, no retrospective, no steering-file schema / rule precedence / conflict resolution, no positioning vs AWS Well- Architected, no failure-mode taxonomy for emulator-vs-real drift, no cost model for preview environments, no many-agent scaling story. Introduces systems/aws-sam, systems/dynamodb-local, systems/aws-fargate; concepts/agentic-development, concepts/fast-feedback-loops, concepts/local-emulation, concepts/contract-first-design, concepts/hexagonal-architecture, concepts/project-rules-steering, concepts/machine-readable-documentation; patterns/local-emulation-first, patterns/ephemeral-preview-environments, patterns/hybrid-cloud-testing, patterns/layered-testing-strategy, patterns/tests-as-executable-specifications, patterns/ci-cd-agent-guardrails; extends systems/aws-lambda (canonical local-emulation target via SAM), systems/amazon-api-gateway (SAM local-emulation front), systems/amazon-ecs (same-image local-run discipline), systems/aws-glue (data-workload instance of local-emulation- first via Docker images), systems/aws-sns / systems/aws-sqs (canonical no-local-emulator services driving hybrid-cloud- testing), systems/aws-cloudformation / systems/aws-cdk (IaC substrate for hybrid dev stacks + preview environments), systems/aws-iam (runtime-only config surface smoke tests exist to catch), systems/kiro (first AWS source pinning .kiro/steering/ path + Markdown format), concepts/monorepo (agent-context enabler), concepts/specification-driven-development (project rules as codebase-hosted corner of the spec-driven stack). → sources/2026-03-26-aws-architecting-for-agentic-ai-development-on-aws
2025-11-26 — AWS Architecture Blog, Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall. Reference-architecture post on deploying centralised network inspection for Amazon EVS (managed VMware Cloud Foundation on EC2 bare-metal in a customer VPC) using AWS Network Firewall + AWS Transit Gateway. Load- bearing architectural ideas: (1) Network Firewall as a bump-in-the-wire middlebox inserted into the traffic path by route-table updates (VPC + TGW), not application changes; (2) the native TGW ↔ Network Firewall integration (GA July 2025) auto-provisions the inspection-VPC subnets / route tables / endpoints and creates a TGW attachment of resource type Network Function with Appliance Mode automatically enabled — closing the historic stateful-inspection-across-AZs landmine; (3) traffic is forced through the firewall by the [[patterns/pre- inspection-post-inspection-route-tables|pre-inspection / post-inspection TGW route-table split]]: all VPC + Direct Connect Gateway attachments associate with the pre-inspection RT (0.0.0.0/0 → firewall attach); the firewall attachment associates with the post-inspection RT, which holds per-destination static routes back to each VPC / on-prem CIDR; (4) one topology inspects east-west (EVS↔VPC, VPC↔VPC), north-south (VPC↔on-prem via DXGW, VPC↔internet via dedicated egress VPC with NAT, internet→VPC via dedicated ingress VPC with ALB), and on-prem↔internet; (5) EVS's NSX overlay segments (192.168.0.0/19 summarised) are propagated into AWS-native VPC route tables by Amazon VPC Route Server speaking BGP with the NSX edge — without this, TGW has no route to the VM CIDRs and east-west inspection blackholes for VM traffic. FQDN-based egress filtering via Network Firewall's Domain-list stateful rule group demonstrated (matches SNI for HTTPS, Host header for HTTP) — sibling primitive to the SNI-based egress filtering already documented at the VPC-egress scale. East-west ICMP drop + ingress HTTP alert demonstrate standard stateful 5-tuple rule shapes. Two CloudWatch log groups (alert + flow) as the canonical logging convention. Default route-table association + propagation explicitly deselected on TGW so new attachments can't accidentally bypass inspection. No production scale numbers (RPS, bandwidth, cost) — reference architecture with demo CIDRs. Introduces systems/amazon-evs, systems/aws-vpc-route-server; concepts/centralized-network-inspection, concepts/bump-in-the-wire-middlebox, concepts/tgw-appliance-mode; patterns/pre-inspection-post-inspection-route-tables; extends systems/aws-network-firewall (new centralised- inspection-via-TGW-native-attachment section + FQDN Domain-list rule-group section + logging conventions), systems/aws-transit-gateway (new centralised-inspection hub section + Appliance Mode automatic-enablement), systems/aws-direct-connect (new DXGW-as-TGW-attachment-in- inspection-path section), concepts/egress-sni-filtering (FQDN / Host-header matching as sibling primitive to SNI, centralised-inspection scale-out).
2026-04-08 — AWS Architecture Blog, Build a multi-tenant configuration system with tagged storage patterns. Reference- architecture post on a multi-tenant configuration service composed of three architectural primitives worth extracting: (1) tagged-storage routing — a Strategy-Pattern factory keyed on the request's key prefix dispatches each call to the appropriate storage backend (tenant_config_* → DynamoDB for high- frequency per-tenant reads; param_config_* → Parameter Store for shared hierarchical config); adding a third backend (Secrets Manager, S3) is one new strategy class plus a map entry; (2) event-driven config refresh — Parameter Store writes emit EventBridge events, a Lambda receives them, queries Cloud Map for healthy Config Service instances, and pushes fresh values via gRPC refresh RPCs into in-place in-memory caches — escape valve from the TTL-vs- staleness dilemma without polling or service restarts; (3) JWT tenant-claim extraction — service never accepts tenantId from request parameters; extracted exclusively from the validated Cognito JWT's immutable custom:tenantId claim, making cross-tenant access structurally impossible via request-body manipulation. Stack: NestJS gRPC services on ECS Fargate in private subnets behind ALB / VPC Link / API Gateway / WAF / Cognito; shared IAM execution role (TVM/STS-based per-tenant credentials pitched as next-step for compliance-grade isolation at 50-100ms-per-op cost). DynamoDB composite key pk = TENANT#{id} / sk = CONFIG#{type} provides data-layer tenant isolation. Multi- dimensional extension pk = TENANT#{id}|SERVICE#{name} for service-level isolation in a shared account. DAX named for microsecond-latency acceleration at 1000+ RPS (5-10× vs single- digit-ms DynamoDB baseline). Cache-security discipline: tenant- prefixed keys, credentials + PII never cached, JWT validation + DynamoDB composite keys as final enforcement boundary. Positioned as the lightest shape on the wiki's tenant-isolation spectrum (single shared account, app + JWT enforcement; sits between Convera's in-account multi-layer and ProGlove's account-per- tenant). No production scale numbers disclosed — reference architecture with CloudFormation sample on GitHub. Introduces systems/aws-parameter-store; concepts/cache-ttl-staleness-dilemma; patterns/tagged-storage-routing, patterns/event-driven-config-refresh, patterns/jwt-tenant-claim-extraction; extends systems/dynamodb (composite-key tenant isolation + multi-dimensional extension), systems/amazon-cognito (immutable custom attribute detail), systems/amazon-eventbridge (Parameter Store change-event integration), systems/aws-lambda (invalidator compute role), systems/grpc (refresh-RPC transport + internal service-to-service), systems/aws-cloud-map (refresh-time service discovery), systems/amazon-ecs, systems/aws-fargate, systems/aws-secrets-manager (future-extension backend), systems/aws-systems-manager (Parameter Store as sub-service), concepts/tenant-isolation (shape-spectrum lightest-entry row), concepts/event-driven-architecture (config-refresh instance). → sources/2026-04-08-aws-build-a-multi-tenant-configuration-system-with-tagged-storage-patterns
2026-04-06 — AWS Architecture Blog, Unlock efficient model deployment: Simplified Inference Operator setup on Amazon SageMaker HyperPod. Product-announcement framing for the HyperPod Inference Operator shipping as a native EKS add-on (replacing the prior Helm-chart install path) with a built-in helm_to_addon.sh migration script. Four architectural primitives worth extracting sit inside the one-click-install marketing frame: (1) multi-instance-type fallback via Kubernetes node affinity — InferenceEndpointConfig.spec.instanceTypes takes a priority-ordered list (["ml.p4d.24xlarge", "ml.g5.24xlarge", "ml.g5.8xlarge"]); compiles to requiredDuringSchedulingIgnoredDuringExecution to restrict placement + preferredDuringSchedulingIgnoredDuringExecution with descending weights for priority; scheduler silently falls back to the next-preferred type on capacity miss — structural answer to GPU-capacity-constrained placement. Raw Kubernetes nodeAffinity also exposed directly on the CRD for custom scheduling (Spot exclusion / AZ preference / custom labels). (2) Managed tiered KV cache as a platform capability (implicit memory hierarchy across GPU HBM / host DRAM / SSD); claimed up to 40% inference-latency reduction for long-context workloads (methodology not disclosed). First wiki instance of managed KV cache as an add-on feature rather than a model-server-library concern. (3) Intelligent routing — three strategies (prefix-aware / KV-aware / round-robin) selected at install time; prefix-aware routes requests sharing a common prompt prefix to the same replica so the KV cache hits on the second request; KV-aware reads live cache-occupancy telemetry before routing. Specialisation of concepts/workload-aware-routing for LLM inference. (4) EKS add-on as lifecycle-packaging — converts the previously-Helm Kubernetes operator + its dependency add-ons (S3 Mountpoint CSI driver / FSx CSI driver / cert-manager / metrics-server) + IAM scaffolding
S3/VPC-endpoint prereqs into a native EKS add-on with managed version bumps, rollback, and a one-shot helm_to_addon.sh migration script (auto-discovery of existing Helm config, OVERWRITE flag, backup files in /tmp/hyperpod-migration-backup-<timestamp>/, tags migrated ALBs / ACM certs / S3 objects with CreatedBy: HyperPodInference). Four IAM roles carved on install (Execution Role + JumpStart Gated Model Role + ALB Controller Role + KEDA Operator Role) as the least-privilege default posture the managed installer produces. Two first-class CRDs under inference.sagemaker.aws.amazon.com/v1: InferenceEndpointConfig (bring-your-own-model from S3) and JumpStartModel (managed catalog, modelId + instanceType). Observability via Amazon Managed Grafana — time-to-first-token (TTFT), end-to-end latency, GPU utilization, cache performance, routing efficiency. Quantitative disclosure minimal / un-methodologied: "hours before a single model can serve predictions" → "within minutes of cluster creation" for install; 40% KV-cache latency reduction hedged as "up to" with no baseline. Borderline Tier-1 ingest — product-PR genre, four transferable architectural primitives keep it above the scope filter. Extends concepts/managed-data-plane (Kubernetes-operator- layer instance — sits on same spectrum as App Mesh → Service Connect and EKS Auto Mode), concepts/shared-responsibility-model (responsibility line moves one layer into previously-customer- operated Helm/IAM/dependency scaffolding). → sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod
2026-04-01 — AWS Architecture Blog, Automate safety monitoring with computer vision and generative AI. Serverless + event-driven CV + GenAI safety-monitoring solution at scale. Target fleet 10,000+ cameras; end-to-end image-capture-to-notification up to 37 s. Core architectural content: (1) serverless driver-worker pattern with one-worker-per-use-case for independent scaling + fault isolation (each use case has its own SNS/ SQS/SageMaker endpoint chain + DLQ); (2) real SageMaker Serverless → Serverful inference pivot at production scale — Serverless inference's no-GPU-support + 6 GB memory ceiling caused OOMs at hundreds of sites → migration to ml.g6 Serverful endpoints + auto scaling + raised Lambda concurrent-execution limits + SQS batch-size tuning; (3) multi-account AWS isolation — training, ingest, web-app, analytics environments each in distinct accounts; raw PII-bearing images purged within days after Rekognition-based anonymisation; (4) four-stage intelligent alarm detection: object detection → zone overlap against "digital tape" (configurable 50% threshold) → loiter-time algorithm tracking same-object persistence via mask similarity across consecutive minute intervals (per-zone acceptable-loiter-time tuning) → multilayered validation (confidence thresholds + RLE mask comparison for cross-interval consistency); (5) per-camera-per-use-case risk aggregation to avoid alert fatigue — append new occurrences to open records; scheduled auto-close on resolution; SLA escalation through per-zone preferred channels (Slack / email / ticket); (6) [[patterns/ data-driven-annotation-curation|data-driven ground-truth curation]] — Athena aggregates false-positive rates across camera types + deployment conditions for prioritised retraining + surfaces below-confidence-threshold inferences + Claude multimodal on Bedrock detects underrepresented object classes (replaces blanket per-site daily annotation jobs that became untenable); (7) GLIGEN-based synthetic-data generation on SageMaker Batch Transform producing a 75,000-image PPE dataset (3 classes)
75,000-image Housekeeping dataset (7 classes) at 512×512 with ground-truth bounding boxes auto-embedded; YOLOv8 trained on PyTorch 2.1 + cosine LR + AdamW reached 99.5% mAP@50 / 100% precision / 100% recall for PPE + 94.3% mAP@50 for Housekeeping without a single manually-annotated real image; (8) training-pipeline / model-promotion decoupling — GT job creation via Step Functions triggered by EventBridge cadence + 7-step SageMaker AI Pipelines (load checkpoint → prep+split → train → drift baseline → evaluate → package → register); model-approval EventBridge event triggers a Lambda to open a code review updating the endpoint's S3 URI — so scientists approve on metrics, engineers merge + deploy via CI/CD; (9) tape-labeling synthetic-composite preparation — hourly Step Functions workflow stitches clear portions of multiple time-shifted camera frames using a voting mechanism (pixel regions with no detected objects) into a composite image where floor tapes are fully visible, solving the occluded-tape annotation problem on newly-onboarded cameras. Introduces systems/gligen, systems/yolo, systems/aws-sagemaker-ground-truth, systems/aws-sagemaker-pipelines, systems/aws-sagemaker-endpoint, systems/aws-sagemaker-batch-transform, systems/amazon-rekognition, systems/amazon-aurora, systems/amazon-cloudfront, systems/aws-appsync, systems/aws-waf, systems/amazon-quicksight, systems/amazon-redshift-spectrum; concepts/alert-fatigue; patterns/serverless-driver-worker, patterns/multilayered-alarm-validation, patterns/alarm-aggregation-per-entity, patterns/data-driven-annotation-curation, patterns/synthetic-data-generation, patterns/multi-account-isolation; extends systems/aws-sagemaker-ai, systems/aws-lambda, systems/aws-sns, systems/aws-sqs, systems/aws-s3, systems/aws-step-functions, systems/amazon-eventbridge, systems/dynamodb, systems/amazon-bedrock, systems/amazon-athena, systems/amazon-route53, concepts/event-driven-architecture, concepts/serverless-compute, concepts/tenant-isolation, concepts/blast-radius. → sources/2026-04-01-aws-automate-safety-monitoring-with-computer-vision-and-generative-ai
2026-03-31 — AWS Architecture Blog, Streamlining access to powerful disaster recovery capabilities of AWS (co-authored with Arpio, AWS Resilience Competency Partner). Survey-style layered DR architecture: data protection via AWS Backup → compute DR via AWS DRS → whole-workload recovery via partner orchestration. Canonical wiki reference for: (1) the resilience dimension of Shared Responsibility — AWS provides primitives, customer owns orchestration / testing / config translation; (2) cross-Region vs cross-account as orthogonal isolation axes (cross-Region = fault isolation; cross-account = clean-room recovery for ransomware); (3) AWS Backup as unified backup control plane — vaults, policies, schedules; closed native-service gaps for EFS, FSx, and cross-Region backup for DynamoDB; (4) AWS DRS quantified numbers: "crash-consistent RPO of seconds, RTO typically 5–20 min" via continuous block-level replication (concepts/crash-consistent-replication); (5) the DR configuration translation problem — restored resources get new endpoints; canonical mechanism is a Route 53 private hosted zone CNAME mapping old-endpoint → new-endpoint in the recovered VPC so applications don't need config rewrite + redeploy on failover; (6) least-privilege cross-account DR agent pattern (Arpio's IAM role explicitly denied from mutating source OR reading/exfiltrating data). Partner-post classification — ~40% body is Arpio product positioning; architectural signal concentrated in the AWS Backup / AWS DRS / Route 53 CNAME / shared-responsibility sections. Only quantified numbers: seconds RPO, 5–20 min RTO (DRS), >140 AWS resource types (Arpio coverage). No multi- Region active-active discussion, no cost numbers, no DR testing/drill cadence, no cross-partition axis (covered separately by Sovereign Failover). Companion to the Generali EKS Auto Mode customer-case study in the partner/customer marketing-leaning subset of AWS Architecture Blog. → sources/2026-03-31-aws-streamlining-access-to-dr-capabilities.md
2026-03-23 — AWS Architecture Blog, How Generali Malaysia optimizes operations with Amazon EKS. Customer-case-study for Generali Malaysia's adoption of EKS Auto Mode (AWS-managed K8s data plane on Bottlerocket nodes with weekly AMI replacement, default-add-on upgrades, cluster-version upgrades). Canonical wiki reference for the peer-AWS-service integration surface of EKS — six managed services wired into one cluster: GuardDuty (EKS protection + runtime monitoring, MITRE ATT&CK-annotated findings), Inspector (ECR-image-to- running-container vulnerability prioritisation), Network Firewall (SNI-based egress allow-list with the private → firewall-public → NAT-protected topology), Secrets Manager + External Secrets Operator (env-var-only secret injection, no volume mounts — aligns with stateless-only discipline), Amazon Managed Grafana (per-namespace dashboards with CloudWatch data source), AWS Billing split cost allocation data for EKS (cluster / namespace / deployment / node native cost-allocation tags). Compound K8s operating discipline stated as platform-wide rules: stateless- only micro-services + immutable pods + Helm charts as standardised deployment mechanism + HPA traffic-driven auto-scaling. Customer-retained safety contract under Auto Mode's platform-driven node churn: Pod Disruption Budgets + Node Disruption Budgets + off-peak maintenance windows. AWS Well- Architected Framework is the explicit organising structure of the post; shared- responsibility model is extended into the K8s data plane. No quantified outcomes published (cluster sizes / cost deltas / MTTR numbers all absent); value is in the integration topology. Introduces systems/eks-auto-mode, systems/amazon-guardduty, systems/amazon-inspector, systems/aws-network-firewall, systems/external-secrets-operator, systems/amazon-managed-grafana, systems/bottlerocket, systems/generali-malaysia-eks; concepts/well-architected-framework, concepts/shared-responsibility-model, concepts/pod-disruption-budget, concepts/egress-sni-filtering; patterns/runtime-vulnerability-prioritization, patterns/eks-cost-allocation-tags, patterns/disruption-budget-guarded-upgrades; extends systems/aws-eks (Auto Mode role), systems/kubernetes (compound stateless/immutable/Helm/HPA discipline), systems/aws-secrets-manager (EKS-native source- of-record role), systems/helm (platform-wide packaging standard), concepts/managed-data-plane (Kubernetes-layer instance), concepts/observability (namespace-per-tenant CloudWatch→AMG shape), concepts/stateless-compute (enterprise- wide platform discipline), companies/aws (Tier 1, 2026-03-23).
2026-03-18 — AWS Architecture Blog, AI-powered event response for Amazon EKS. Product-launch post for AWS DevOps Agent (systems/aws-devops-agent), a Bedrock-hosted autonomous AI agent for EKS incident investigation — AWS-managed-service peer to Datadog's Bits AI SRE. Canonical wiki reference for telemetry-based resource discovery: two-path K8s discovery (Kubernetes API for static resource state + OpenTelemetry for runtime relationships — service-mesh traffic, distributed traces, metric attribution) fused into a unified dependency graph the agent reasons over. Four data sources per Agent Space: Managed Prometheus (metrics), Amazon CloudWatch Logs (logs), X-Ray (traces), EKS topology (K8s API). Investigation workflow: scenario-template trigger → data collection → ML/statistical pattern analysis against a learned baseline → confidence-scored root-cause ranking → mitigation recommendations. Separate Prevention surface runs weekly (~15h compute budget) over past investigations for code / observability / infrastructure / governance recommendations. Tutorial-heavy post — kubectl port-forwards + Python traffic-generator + Agent-Space-creation UI walkthrough; 1,806 discovered resources in one demo topology view is the only quantitative number; no eval / correctness / latency / cost numbers disclosed. Preview-product status (demo screenshot shows empty Prevention recommendations).
2026-02-26 — AWS Architecture Blog, Digital transformation at Santander: How platform engineering is revolutionizing cloud infrastructure (cowritten with Julio Bando, Santander F1RST). Canonical production reference for platform-engineering at large-enterprise regulated-industry scale: Santander is a global bank (>10 countries, 160M+ customers, 200+ critical systems, billions of daily transactions). Pre-platform, provisioning new infrastructure took up to 90 days and routinely deviated from architecture standards. Solution: Catalyst (systems/santander-catalyst), an in-house internal developer platform co-built with AWS Professional Services through the Platform Strategy Program (PSP). Two load-bearing layers: an in-house developer portal (patterns/developer-portal-as-interface) as the unified self-service surface, and a single EKS control plane cluster hosting three sub-components — data-plane claims managed by ArgoCD for GitOps continuous sync of application stacks; policies catalog using Open Policy Agent (Gatekeeper) as a central repository of compliance + security policies (patterns/policy-gate-on-provisioning, the regulated-industry K8s-admission-time counterpart to SCPs in ProGlove's AWS-Organizations-native shape); and stacks catalog of Crossplane [Composite Resource Definitions
Compositions](<../patterns/crossplane-composition.md>) used "as a universal resource provisioner" (concepts/universal-resource-provisioning) "to manage resources across multiple cloud providers consistently and declaratively." This is the first wiki instance of concepts/control-plane-data-plane-separation applied at infrastructure-provisioning tier: the EKS cluster decides, the provisioned AWS (and multi-cloud) resources are the data plane. Reported outcomes: full provisioning cycle 90 days → hours (best case: minutes); standard provisioning 30 days → 2 days; proof-of-concept preparation 90 days → 1 hour; 100+ pipelines consolidated into one control plane; generative AI agent stack implementation 105 days → 24 hours, eliminating "dozens of provisioning tickets per environment". Three workloads cited as evidence Catalyst is a universal platform: (1) generative AI agents stack, the first success case; (2) modern data platform with built-in Databricks integration + data lakes + automated ETL + centralized data catalog + segregated experimentation environments — ~3,000 monthly data-experimentation provisioning tickets eliminated; (3) cloud process orchestration migrating legacy workflows to AWS Step Functions + retry patterns + error handling + centralized process monitoring. Catalyst + ProGlove Insight pin the patterns/platform-engineering-investment spectrum at both ends (regulated bank with K8s-native substrate vs SaaS multi-tenant with AWS-native substrate; same pattern, different substrate). Cultural outcome framed as co-equal with the technical outcome: "Catalyst also catalyzed a cultural change within Santander, promoting an automation and self-service mindset among development teams." Marketing-leaning AWS- Architecture-Blog format — architectural shape stated in full (EKS + Crossplane + ArgoCD + OPA + portal + XRDs) with strong quantified outcomes but no p50/p99 distribution shape, no EKS cluster sizing, no Crossplane Composition examples, no OPA policy examples, no incident retrospective, no post-PSP in-house team size. Introduces systems/santander-catalyst, systems/crossplane, systems/argocd, systems/open-policy-agent, systems/databricks; concepts/universal-resource-provisioning, concepts/gitops; patterns/developer-portal-as-interface, patterns/crossplane-composition, patterns/policy-gate-on-provisioning; extends systems/aws-eks (new role as infrastructure control plane beyond app compute), systems/aws-step-functions (legacy- workflow modernization target), patterns/platform-engineering-investment (second canonical production instance), concepts/policy-as-data (OPA/Rego as the third wiki realization alongside Cedar/AVP and AWS SCPs), concepts/control-plane-data-plane-separation (infrastructure-provisioning tier as new layer), patterns/golden-path-with-escapes (Catalyst's stacks catalog as the multi-cloud-infrastructure- level sibling to Figma's K8s-service-def instance). → sources/2026-02-26-aws-santander-catalyst-platform-engineering
2026-02-25 — AWS Architecture Blog, 6,000 AWS accounts, three people, one platform: Lessons learned (cowritten with Julius Blank, ProGlove). Canonical production reference for account-per-tenant isolation on AWS at SaaS-tenant scale. ProGlove's Insight platform (systems/proglove-insight): ~6,000 production AWS accounts, 3-person platform team, >120,000 deployed service instances, ~1,000,000 Lambda functions in production. The AWS account boundary is the sole structural isolation mechanism — no shared compute / storage / network / IAM across tenants — which delivers blast-radius containment, a simplified developer mental model ("a deployed service instance always belongs to exactly one tenant"), per-tenant customization, and transparent cost attribution via Cost Explorer on linked accounts. Well-Architected review simplification — many Operational Excellence / Security pillar items "didn't even apply" because isolation is structural rather than implemented in code. Cost framing: services that scale linearly with account count (smallest EC2 ≈ $3/mo → $3,000/mo at 1,000 accounts) must be avoided; Lambda + DynamoDB scale-to-zero is what makes the architecture economically viable. Deployment: single monorepo → single CodePipeline execution → single StackSet update op → parallel fan-out to all tenant accounts from a central Infrastructure account (patterns/fan-out-stackset-deployment). Named failure modes: partial rollouts (retry/rollback must be defined and tested), pipeline duration (large-scale updates take significant time to propagate), tooling maturity (StackSets "powerful but still evolving"). Account lifecycle asymmetry: creation is fully automated via Step Functions, retirement is manual scripts run regularly — architectural signal that criterion for automation is overhead introduced, not dogma (patterns/automate-account-lifecycle). Baseline guardrails = SCPs + AWS Organizations + strict IAM management. Observability: third-party aggregation application forwarded from all tenant accounts, with multi-alerts defined once and applied across tenant accounts individually (patterns/central-telemetry-aggregation); engineers see a single pane of glass while telemetry originates per-account. Key discipline: don't replicate per-account alarms blindly (use streaming/aggregation), tag everything including source account ID (consider Organizations tag policies), CloudWatch OAM called out as the AWS-native primitive that has since shipped. Per-account quotas become a distributed-monitoring problem — Lambda concurrent-execution quota named canonical case (concepts/per-account-quotas); a "single pane of glass quota tracker" is essential. The three-person team only works because of deliberate platform-engineering investment — complexity shifted from application code to platform tooling; "the team size stays constant, and efficiency grows with every account added." Candid on gaps: "multi-account strategies are common at the enterprise level, adopting them at the SaaS tenant level is less common. Patterns, tooling, and reference architectures are still evolving, which means building custom solutions becomes necessary." No latency / throughput numbers, no incident retrospective; positioned as prescriptive- retrospective co-authored post. Introduces concepts/account-per-tenant-isolation, concepts/blast-radius, concepts/per-account-quotas, concepts/service-control-policy; patterns/fan-out-stackset-deployment, patterns/automate-account-lifecycle, patterns/central-telemetry-aggregation, patterns/platform-engineering-investment; systems/aws-stacksets, systems/aws-codepipeline, systems/aws-step-functions, systems/aws-cost-explorer, systems/aws-observability-access-manager, systems/aws-cloudformation, systems/proglove-insight; extends systems/aws-organizations, systems/aws-lambda, systems/dynamodb, systems/aws-iam, concepts/tenant-isolation (the architectural opposite of Convera's in-account multi-layer shape). → sources/2026-02-25-aws-6000-accounts-three-people-one-platform
2026-02-05 — AWS Architecture Blog, How Convera built fine-grained API authorization with Amazon Verified Permissions. Cross-border-payments platform rolling out Amazon Verified Permissions (managed Cedar engine) across four authorization flows on a single shared Lambda-authorizer + API Gateway shape: (1) customer UI + API, (2) internal customer-service apps federated from Okta through Cognito, (3) service-to-service machine-to-machine (patterns/machine-to-machine-authz via Cognito client-credentials), and (4) multi-tenant SaaS via per-tenant policy stores with DynamoDB tenant_id → policy-store-id mapping + backend zero-trust re-verification + RDS-side tenant-context enforcement as the last line of defense. Attribute sourcing via pre-token-generation Lambda hook (concepts/token-enrichment) reads roles from RDS (customer flow) or attributes from DynamoDB (internal / multi-tenant flows) and signs them into the Cognito access token; downstream authorizer evaluates Cedar policies against JWT claims only — no second round-trip. concepts/policy-as-data governance: Cedar policies live in DynamoDB (source of truth) + DynamoDB Streams sync pipeline continuously propagates changes into AVP; authorship gated by a strictly-regulated IAM role owned by infosec. Submillisecond end-to-end latency is a product of a two-level cache — API Gateway (authorizer-decision cache) plus app-level Cognito token cache — not AVP alone (AVP is "millisecond-level"). Reported outcomes: thousands of authorization requests per second, submillisecond latency, ~60% reduction in time spent on access-management tasks. Subtle correctness property called out: the same Cedar policy must be evaluated at both the UI level (to gate affordance visibility) and the API level (to gate enforcement) — skipping the API-side check on the assumption that the UI is the only client is the canonical fine-grained-auth anti-pattern. Marketing-leaning AWS Architecture Blog format — architectural signal is dense (the policy-store-per-tenant tradeoff enumeration, the zero-trust re-verification step, the DynamoDB-Streams policy sync, the access-token-vs-ID-token distinction, the one-shape-four-flows reuse) but with no latency distribution, no cost baseline, no Cedar policy volume, no incident postmortem, and no discussion of policy-store resource-quota limits. Introduces systems/amazon-verified-permissions, systems/cedar, systems/amazon-cognito, systems/amazon-api-gateway, systems/okta; concepts/fine-grained-authorization, concepts/attribute-based-access-control, concepts/policy-as-data, concepts/tenant-isolation, concepts/zero-trust-authorization, concepts/authorization-decision-caching, concepts/token-enrichment; patterns/lambda-authorizer, patterns/per-tenant-policy-store, patterns/pre-token-generation-hook, patterns/zero-trust-re-verification, patterns/machine-to-machine-authz; extends systems/aws-iam, systems/aws-lambda, systems/dynamodb, systems/aws-rds, systems/aws-eks, systems/aws-policy-interpreter, concepts/least-privileged-access. → sources/2026-02-05-aws-convera-verified-permissions-fine-grained-authorization
2026-02-04 — AWS Architecture Blog, Mastering millisecond latency and millions of events: The event-driven architecture behind the Amazon Key Suite. Amazon Key team's retrospective on modernising their access-management platform from a tightly-coupled monolithic design (+ ad-hoc SNS/SQS pairs) to EventBridge-centric event-driven architecture. Core organisational move is patterns/single-bus-multi-account (AWS reference pattern): central DevOps-owned event bus + rules + targets, per-service-team accounts owning application stacks, logical separation via rules. Three custom components built on top of EventBridge to close its gaps: a custom schema repository (JSON-Schema Draft-04; versioned; build-time code bindings; chosen because EventBridge has a schema registry but no native validation — "EventBridge provides developers with tools to implement validation using external solutions or custom application code, it currently does not include native schema validation capabilities"); a client library with client-side validation (evaluated vs centralized validation service + rejected on extra network hop + own-scaling overhead) handling code bindings + pre- publish validation + serde + publish/subscribe abstractions; and a CDK subscriber constructs library (patterns/reusable-subscriber-constructs) provisioning per- subscriber event bus + cross-account IAM + monitoring + alerting from ~5 lines. Named pre-migration failure modes: cross-service cascade deadlocks ("an issue in Service-A triggered a cascade of failures across many upstream services, with increased timeouts leading to retry attempts and ultimately resulting in service deadlocks") + single-device-vendor fleet-wide blast radius + loose schemas blocking safe schema evolution + ad-hoc SNS/SQS pairs without standardisation. Reported post-migration numbers: 2,000 events/s, 99.99% success rate, 80ms p90 ingestion→target invocation across 14M subscriber calls, integration time for new use cases 5d → 1d (80% improvement), new-event onboarding 48h → 4h, publisher/subscriber integration 40h → 8h, standardized client library addressed 90% of common integration errors, 100% single-control-plane governance of event-bus infrastructure, 100% automated unauthorized-data-exchange detection. Marketing-leaning AWS-Architecture-Blog format — architectural-signal dense (ownership split + schema-repository vs registry distinction + CDK construct shape + the specific failure modes that motivated the move) but no comparison baselines / distribution shapes / cost breakdown / DLQ-and-poison-pill design. Introduces systems/amazon-eventbridge, systems/amazon-key, systems/aws-cdk; concepts/event-driven-architecture, concepts/service-coupling, concepts/schema-registry; patterns/single-bus-multi-account, patterns/client-side-schema-validation, patterns/reusable-subscriber-constructs; extends systems/aws-sns, systems/aws-sqs. → sources/2026-02-04-aws-amazon-key-eventbridge-event-driven-architecture
2026-01-30 — AWS Architecture Blog, Sovereign failover: Design for digital sovereignty using the AWS European Sovereign Cloud. Architectural companion to the skipped 2026-01-16 AWS European Sovereign Cloud GA launch announcement. Codifies cross-partition failover as the response to human-driven disasters (regulatory / geopolitical / sovereignty shifts that regional redundancy inside a single partition cannot address). Names the four AWS partitions (standard aws / GovCloud aws-us-gov since 2011 / AWS China aws-cn / European Sovereign Cloud aws-eusc since 2026) and the three hard-boundary consequences: IAM credentials don't carry, S3 Cross-Region Replication / Transit Gateway inter-region peering / other cross-region primitives don't work across partitions, and service availability differs per partition. Applies the canonical backup / pilot-light / warm- standby / multi-site active-active DR ladder to the partition axis, with pilot-light the recommended cross-partition default ("only built up when needed"). Names exactly three cross-partition network-connectivity options: internet-over-TLS, IPsec Site-to- Site VPN, and Direct Connect gateway / PoP-to-PoP partner connections. Enumerates five cross-partition auth tactics (IAM roles with trust + external IDs, STS regional endpoints, resource-based policies, cross-account roles via Organizations, and federation from a centralized IdP — called out as modern best practice: patterns/centralized-identity-federation); IAM-user fallback uses Secrets Manager + Lambda + backup-user-for-availability. Introduces "double- signed certificates" — per-partition Private CA root CAs cross-sign each other to enable cross-partition authenticated mTLS while preserving partition isolation; operational complexity (cross-signing agreements, trust-store management, validation / revocation, audit trails) is named. Prescribed Organizations topology: completely separate Organization mandatory for European Sovereign Cloud; paired-optional for GovCloud (but separate still recommended if sovereign-standalone is the goal). Per-partition isolated Transit Gateways / separate Route 53 zones / PrivateLink for secure cross- partition communication; per-partition Config aggregators and Security Hub instances; Control Tower manages commercial side but cannot directly manage GovCloud or European Sovereign Cloud accounts. Vendor-independence against geopolitical risk framed as cheaper via cross-partition (IaC reuse) than cross-cloud. Design-pattern article; no RTO / RPO / cost / latency numbers, no data-synchronization recipe (custom tooling only), no partition-internal architecture detail. Introduces concepts/aws-partition, concepts/digital-sovereignty, concepts/disaster-recovery-tiers, concepts/cross-partition-authentication, concepts/cross-signed-certificate-trust; patterns/cross-partition-failover, patterns/pilot-light-deployment, patterns/warm-standby-deployment, patterns/centralized-identity-federation; systems/aws-european-sovereign-cloud, systems/aws-govcloud, systems/aws-iam, systems/aws-sts, systems/aws-organizations, systems/aws-control-tower, systems/aws-direct-connect, systems/aws-transit-gateway, systems/aws-privatelink, systems/aws-config, systems/aws-security-hub, systems/aws-secrets-manager. → sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty
2026-01-12 — AWS Architecture Blog, How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters. Customer-case-study of Salesforce's mid-2025 → early-2026 migration of 1,000+ EKS clusters / 1,180+ node pools / thousands of internal tenants from the Kubernetes Cluster Autoscaler + Auto Scaling groups to Karpenter. Canonical wiki reference for Karpenter at extreme scale. Motivations: thousands of rigid node groups, multi-minute scaling latency, poor AZ balance
memory-workload performance bottlenecks, inefficient bin-packing with stranded capacity. Tooling: in-house Karpenter transition tool with three first-class design principles — zero-disruption (PDB-respecting drain), rollback- capable (reverse-transition to ASG first-class), CI/CD-integrated; plus a Karpenter patching check tool for AMI validation. Config translation: automated mapping from ASG fields (instance type, root-volume size/IOPS/type/ throughput, node labels) to Karpenter's NodePool / EC2NodeClass across all 1,180+ pools. Rollout: phased with soak times under risk-based sequencing (low-risk environments first, prod last). Five operational lessons — each a generalisable principle: (1) PDB hygiene as governance — overly restrictive / misconfigured PDBs block node replacement; fix with audit + app-owner partnership + OPA admission-time validation; (2) sequential node cordoning with verification checkpoints beats parallel — parallel destabilised clusters; (3) [[concepts/kubernetes-label-length-limit|63-character label limit]] is a migration-blocker that hides in human-friendly naming conventions (analytics-bigdata-spark-executor-pool-m6a- 32xlarge-az-a-b-c = 67 chars); (4) Singleton protection under bin-packing consolidation via guaranteed-pod-lifetime + workload- aware disruption policies (PDBs structurally can't protect 1-replica pods); (5) 1:1 ephemeral-storage translation — not defaulting — required for I/O-intensive workloads. Outcomes: scaling latency minutes → seconds; 80% manual-ops reduction; 5% FY2026 cost savings with another 5-10% projected FY2027; eliminated thousands of node groups; heterogeneous GPU / ARM / x86 in single NodePool; improved IP efficiency via subnet-decoupled provisioning; true self-service infrastructure via developer-authored NodePool CRDs. Industry context: Datadog reports +22% Karpenter-provisioned node share in the last 2 years. Introduces systems/salesforce, systems/cluster-autoscaler, systems/aws-auto-scaling-groups; concepts/scaling-latency, concepts/singleton-workload, concepts/availability-zone-balance, concepts/ip-address-fragmentation, concepts/kubernetes-label-length-limit, concepts/self-service-infrastructure; patterns/automated-configuration-mapping, patterns/phased-migration-with-soak-times, patterns/rollback-capable-migration-tool, patterns/sequential-node-cordoning, patterns/risk-based-sequencing; extends systems/karpenter (canonical largest-scale production reference), systems/aws-eks (1,000-cluster column in the role-axis table), systems/open-policy-agent (OPA as operational-correctness enforcement, not just security), concepts/bin-packing (solver-side at production scale), concepts/pod-disruption-budget (OPA-enforced admission governance pattern), patterns/disruption-budget-guarded-upgrades (customer-managed-autoscaler variant alongside the Generali managed-data-plane variant). → sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters
2025-12-11 — AWS Architecture Blog, Architecting conversational observability for cloud applications. Reference architecture for a generative-AI-powered Kubernetes troubleshooting assistant on EKS, with a companion GitHub sample. Canonical wiki companion to the later 2026-03-18 AWS DevOps Agent product launch — this is the self-build blueprint, that is the AWS-managed-service shape, of the same problem. Core architectural content: (1) two deployment options selected by a single Terraform variable — a RAG-based chatbot (default) and a Strands Agents + MCP variant; (2) telemetry-to-RAG pipeline: Fluent Bit DaemonSet → Kinesis Data Streams buffer → Lambda normalize + batch-embed via Bedrock's amazon.titan-embed-text-v2:0 → OpenSearch Serverless k-NN index (hot tier); the Strands variant stores 1024-dim embeddings in S3 Vectors instead (cold tier); (3) an in-cluster troubleshooting assistant pod with a read-only RBAC service account and a static kubectl allowlist — canonical read-only agent action allowlisting; (4) an iterative agentic troubleshooting loop — retrieve → LLM proposes kubectl → assistant runs → output back to LLM → LLM decides continue or conclude — combining historical telemetry + real-time cluster state in one context; (5) the agentic variant uses three specialized agents (Agent Orchestrator + Memory Agent + K8s Specialist — patterns/specialized-agent-decomposition) calling EKS MCP Server over MCP; Slack bot as the UI; Pod Identity for AWS service access; (6) security discipline — sanitize logs before embedding (vectors inherit source data governance), KMS encryption Kinesis-in-transit + OpenSearch-at-rest, private subnets + VPC endpoints + prompt- injection input validation. MTTR framing cites 2024 Observability Pulse Report: 48% of orgs name team- knowledge gaps as their biggest observability challenge, 82% say issue resolution takes >1h. Explicit "Pro tip": Lambda should batch Kinesis consumption + embedding generation
OpenSearch writes for cost. No evaluation / MTTR-delta / cost / prompt-injection-resistance numbers disclosed — architecture + sample repo only. Post asserts the approach extends to ECS and Lambda but only EKS is demonstrated. Introduces systems/strands-agents-sdk, systems/eks-mcp-server, systems/fluent-bit, systems/amazon-kinesis-data-streams; concepts/agentic-troubleshooting-loop; patterns/allowlisted-read-only-agent-actions, patterns/telemetry-to-rag-pipeline; extends systems/aws-eks (AI-troubleshooting target — self-build variant), systems/amazon-bedrock (embedding model + LLM substrate), systems/amazon-opensearch-service (hot-tier telemetry vector store), systems/s3-vectors (cold-tier telemetry vector store in Strands variant), systems/amazon-titan-embeddings (telemetry domain), systems/aws-lambda (telemetry-to-RAG compute tier), systems/model-context-protocol (cluster-operations tool surface), systems/aws-devops-agent (managed-service sibling), concepts/observability (self-build shape under agent-assisted debugging layer), patterns/specialized-agent-decomposition (three-agent Kubernetes split). → sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications
2025-07-16 — AWS News Blog, Introducing Amazon S3 Vectors: First cloud storage with native vector support at scale (preview) (Channy Yun). Launches S3 Vectors as a first-class S3 data primitive for vector similarity indices. Resource model: vector bucket → vector index → vectors + metadata; launch limits 10,000 indexes/bucket × tens-of-millions vectors/index; Cosine or Euclidean distance (per index); float32 data; key-value metadata usable as query filters. Separate s3vectors API client (put_vectors, query_vectors, ...). SSE-S3 (default) or SSE-KMS encryption. Claims up to 90% TCO reduction vs DRAM/SSD vector-cluster storage; subsecond query performance at scale. Two integrations at launch: Bedrock Knowledge Bases selects an S3 vector bucket as the vector store for RAG apps (exposed in Bedrock console and SageMaker Unified Studio), and a one-click Advanced search export → Export to OpenSearch flow migrates a vector index to an OpenSearch Serverless k-NN collection — canonical patterns/cold-to-hot-vector-tiering (AWS's stated hot uses: "product recommendations or fraud detection"). Paved embedding path: Amazon Titan Text Embeddings V2 via bedrock.invoke_model. Launch regions: IAD, CMH, PDX, FRA, SYD. Internal index architecture (HNSW / IVF / hybrid), concrete latency numbers, filter-with-ANN semantics, and competitive recall comparisons are not disclosed. Marketing-leaning post; primary architectural signal is API shape + capacity ceiling + tiering story. Introduces concepts/vector-embedding, concepts/vector-similarity-search, and concepts/hybrid-vector-tiering to the wiki. → sources/2025-07-16-aws-amazon-s3-vectors-preview-launch
2025-05-03 — AWS Database Blog, Understanding transaction visibility in PostgreSQL clusters with read replicas (Sergey Melnik). AWS's response to Jepsen's 2025-04-29 report on Amazon RDS for PostgreSQL Multi-AZ cluster transaction- visibility behavior. Confirms Jepsen's empirical finding but re-situates the anomaly as inherent to community PostgreSQL (discussed on pgsql-hackers since 2013), not RDS-specific. Mechanism: Postgres's commit path writes the WAL commit record (durable) then asynchronously removes the xid from the in-memory ProcArray (visible); two concurrent non-conflicting commits can flip ProcArray removal order relative to WAL LSN, admitting the Long Fork anomaly — a violation of concepts/snapshot-isolation's atomic-visibility property (concepts/visibility-order-vs-commit-order). Affects all Postgres isolation levels (Read Committed / Repeatable Read / Serializable) because all take snapshots via ProcArray. Absent in Single-AZ Postgres, systems/aurora-limitless, and systems/aurora-dsql (both replace ProcArray with time-based MVCC via Postgres-extension surgery — see patterns/postgres-extension-over-fork). Worked Alice-and-Bob illustration: primary says #1, replica says #2, commit log says #2 — both observers correct under SI on their own node, jointly incompatible under formal SI. Proposed upstream fix: Commit Sequence Numbers (CSN) — stamp a monotonic CSN on each commit and snapshot by watermark comparison; multi-patch effort presented at PGConf.EU 2024; AWS PostgreSQL Contributors Team (formed 2022) participating. Practical impact on end-user apps is low (most apps serialize via row conflicts / app-level ordering) but load-bearing against five enterprise capabilities: distributed-SQL consistent snapshots, read-write splitting, snapshot-then-WAL-replay data sync, PITR to LSN, and tuple-xid-to-logical-commit-time replacement. Also names a CPU-cost angle — ProcArray scanning is "a measurable fraction of CPU" at thousands of connections on large Postgres servers. AWS-recommended workarounds: never rely on implicit commit ordering, introduce explicit synchronization (shared counters, timestamps, database constraints). Vendor response to a third-party analysis; no throughput/latency/cost numbers beyond the CPU-fraction qualitative claim. → sources/2025-05-03-aws-postgresql-transaction-visibility-read-replicas
2025-01-18 — AWS Containers Blog, Migrating from AWS App Mesh to Amazon ECS Service Connect (also doubles as the AWS App Mesh discontinuation announcement). Canonical EOL-migration article: App Mesh closed to new customers 2024-09-24, fully discontinued 2026-09-30. Architectural comparison between App Mesh's four-tier Envoy-sidecar abstraction (Mesh / Virtual Service / Virtual Router / Virtual Node + self-managed systems/envoy sidecar per ECS Task) and Service Connect's flat Client/Server role model + single Cloud Map namespace + AWS-managed Service Connect Proxy. The load-bearing architectural shift is concepts/managed-data-plane — same Envoy, different operational contract. Explicit 5-feature delta: retry/outlier tuning (full vs timeouts-only), version-weighted routing (yes vs no), observability (DIY vs free CloudWatch), mTLS (yes vs not yet), cross-account mesh sharing (AWS RAM vs single-account only). EKS customers directed to Amazon VPC Lattice separately (not a sidecar mesh). Prescribes patterns/blue-green-service-mesh-migration with Route 53 weighted records / CloudFront continuous deployment / ALB multi-target-group — because an ECS Service cannot be in both meshes simultaneously and the two meshes have no cross-environment networking. No quantitative numbers (architecture + migration-pattern post, not retrospective). → sources/2025-01-18-aws-app-mesh-discontinuation-service-connect-migration
2024-12-04 — AWS News Blog, Prevent factual errors from LLM hallucinations with mathematically sound Automated Reasoning checks (preview) (Antje Barth). Preview-launch of systems/bedrock-guardrails-automated-reasoning-checks in US West (Oregon): the first productized AWS neurosymbolic safeguard. First public disclosure of the end-to-end concepts/autoformalization pipeline (document upload → identify concepts → decompose units → translate to formal logic → validate → combine into logical model → review rules + typed variables in UI). Three-verdict validation output (Valid / Invalid / Mixed results) with structured suggestions (unstated-assumptions vs variable-assignments). Canonical regenerate-with-feedback Python snippet — the reasoner's rule descriptions are wrapped in <feedback> tags and re-prompted to the LLM (natural language, never the formal form). Variable-description-as-tuning-knob (is_full_time worked example). Inventories AWS's pre-LLM automated-reasoning portfolio across five service areas: storage, networking, virtualization, identity, cryptography. Positioned as complementary to prompt engineering / RAG / contextual grounding — only major-cloud safeguard combining safety + privacy + truthfulness. The concrete substrate for the more- abstract 2026-02 Cook thesis piece. → sources/2024-12-04-aws-automated-reasoning-to-remove-llm-hallucinations
2024-07-29 — AWS Open Source Blog, Amazon's Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2. Amazon Retail BDT's multi-year migration of their copy-on-write compactor off Spark (on EMR) onto a hand-crafted Ray application on EC2. Q1 2024: 1.5 EiB Parquet input, 4 EiB Arrow in-memory, >10k vCPU-years/quarter, 82% better cost efficiency per GiB vs Spark, 100% on-time delivery, Ray still trailing Spark on first-time reliability (99.15% vs 99.91%). Contributed The Flash Compactor to Ray's systems/deltacat. ~$120M/yr typical-EC2- customer-equivalent annual saving. Open-source extensions target Iceberg/Hudi/Delta. +24% additional win via Daft-on-Ray I/O. → sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2