AWS (Amazon Web Services)¶
The AWS blog family — the AWS News Blog, AWS Open Source Blog,
AWS Architecture Blog, AWS Compute Blog, AWS Storage Blog,
and others at aws.amazon.com/blogs/* — collectively form one of the
canonical Tier-1 system-design sources. AWS blog posts vary widely in
signal: at one end, substantive architecture retrospectives with
quantified production numbers (Amazon Retail BDT's Spark-to-Ray
migration is the canonical recent example); at the other, product
PR / feature announcements filtered out per the AGENTS.md scope
rules.
For the complementary (and often higher-signal) source, see companies/allthingsdistributed — Werner Vogels' blog republishes primary-source AWS / Amazon architecture content from CTO perspective.
Scope and what we ingest from AWS blogs¶
Ingest eagerly (Tier 1 treatment):
- Production architecture retrospectives with concrete scaling numbers (e.g. BDT's exabyte-scale Ray migration).
- Team postmortems or incident writeups with named systems.
- AWS service-design posts that explain trade-offs, not just features (often cross-posted with companies/allthingsdistributed).
- Open-source contribution narratives that expose internal design (DeltaCAT, Firecracker, Aurora DSQL, etc.).
Skip:
- Service-GA announcements / feature launches without architectural depth (PR/FAQ posts belong on companies/allthingsdistributed if they have architectural content).
- Industry / vertical marketing posts ("AI for X industry").
- Pricing announcements, account-opening posts, region-launch announcements.
- Customer-case-study puff pieces.
- Conference-session recaps without architectural specifics.
Key systems (as surfaced in ingested sources)¶
Data platform / Amazon Retail BDT:
- systems/ray — open-source distributed compute framework (Berkeley RISELab); specialist successor to Spark for Amazon BDT's exabyte-scale compactor.
- systems/apache-spark — the generalist that Ray is displacing in Amazon BDT's specialist compactor workload; still more reliable (99.91% vs 99.15% first-time success in 2024).
- systems/deltacat — Ray project for managed-Iceberg/Hudi/Delta compaction; Amazon BDT contributed The Flash Compactor.
- systems/amazon-emr — AWS's managed Hadoop/Spark runtime; substrate for the original Amazon BDT compactor.
- systems/aws-glue, systems/aws-glue-for-ray — serverless Spark / Ray runtimes on AWS.
- systems/anyscale-platform — commercial Ray runtime.
- systems/apache-iceberg, systems/apache-hudi, systems/delta-lake — the three canonical open table formats DeltaCAT targets.
- systems/apache-parquet, systems/apache-arrow — the disk + in-memory columnar formats.
- systems/amazon-redshift, systems/amazon-athena, systems/apache-hive — SQL engines in the post-Oracle BI stack over S3.
- systems/amazon-ion — Amazon's richly-typed self-describing data format; the canonical schema wrapper for the 50+ PB Oracle→S3 migration.
- systems/daft — Python+Rust DataFrame library with Ray integration; +24% cost-efficiency win on Amazon BDT's Ray compactor.
ML / computer-vision — SageMaker AI subsystems + adjacent:
- systems/aws-sagemaker-ground-truth — managed labelling-job substrate; integrates with data-driven curation via Step Functions triggered by EventBridge.
- systems/aws-sagemaker-pipelines — managed ML-workflow orchestration; canonical 7-step pipeline (checkpoint → prep+split → train → drift baseline → evaluate → package → register); model- approval EventBridge event triggers Lambda model-promotion code review cleanly decoupling science + application updates.
- systems/aws-sagemaker-endpoint — managed model-serving; the Serverless → Serverful pivot at production scale is the canonical wiki lesson (Serverless lacks GPU + has 6 GB memory cap → OOMs → ml.g6 Serverful endpoints with auto-scaling).
- systems/aws-sagemaker-batch-transform — batch-inference runtime; canonical substrate for GLIGEN-based synthetic-data generation at 75K-image-per-use-case scale.
- systems/aws-sagemaker-hyperpod — SageMaker's large-scale distributed-training/inference compute substrate with EKS orchestration; host cluster for the Inference Operator and the 2025 observability + training-operator capabilities.
- systems/sagemaker-hyperpod-inference-operator — Kubernetes
controller reconciling
InferenceEndpointConfig+JumpStartModelCRDs on a HyperPod EKS cluster; ships as a native EKS add-on as of 2026-04-06 (previously Helm-packaged). Canonical wiki instance of multi-instance-type fallback via node-affinity, managed tiered KV cache, and prefix-aware / KV-aware intelligent routing as platform-level LLM-serving primitives. - systems/amazon-rekognition — managed CV API; face-detection step in PII anonymisation pipelines under patterns/multi-account-isolation.
- systems/gligen — grounded diffusion model producing photorealistic images with ground-truth bounding boxes embedded by construction; on AWS runs on SageMaker Batch Transform.
- systems/yolo — single-stage real-time object detector family; default CV serving model in PPE + Housekeeping detection pipelines on SageMaker endpoints.
Web application / analytics / BI:
- systems/amazon-cloudfront — global CDN distributing web-app static assets.
- systems/aws-appsync — managed GraphQL API; Lambda resolvers for CRUD + embedded analytics integration.
- systems/aws-waf — managed web-application firewall; common OWASP-class protection.
- systems/amazon-quicksight — BI dashboards; embedded in customer applications via AppSync + Lambda resolvers.
- systems/amazon-redshift-spectrum — S3-external-table SQL query layer; powers BI dashboards over S3-resident risk / event data without ETL movement.
Relational DB (beyond the Aurora sovereignty/consistency lineage below):
- systems/amazon-aurora — cloud-native Postgres / MySQL- compatible relational engine; the parent line of Aurora DSQL + Aurora Limitless. Common application-state backbone for AWS customer architectures; downstream of ML-inference gatekeeping in CV-safety pipelines.
Compute / storage / integration primitives:
- systems/aws-ec2 — canonical substrate for Ray and most non-serverless AWS compute.
- systems/aws-s3 — foundational object storage; also the storage half of every compute-storage-separated AWS analytics stack.
- systems/s3-vectors — S3's vector similarity-search primitive (preview 2025-07-16); storage-tier cost for embeddings, integrates with Bedrock Knowledge Bases and exports to OpenSearch for hot queries.
- systems/amazon-bedrock-knowledge-bases — managed RAG service; pluggable vector stores (S3 Vectors, OpenSearch, Aurora, Pinecone, etc.).
- systems/amazon-titan-embeddings — AWS first-party text
embedding models on Bedrock (
titan-embed-text-v2). - systems/amazon-opensearch-service — managed OpenSearch; k-NN is the hot counterpart to S3 Vectors in cold→hot vector tiering.
- systems/amazon-sagemaker-unified-studio — unified data+AI dev environment; Knowledge Bases with S3 Vectors selectable as vector store.
- systems/dynamodb — durable state store in the Amazon BDT Ray job-management substrate.
- systems/aws-parameter-store — SSM's hierarchical-KV config- store subsystem; EventBridge-emitting change events make it the canonical source-of-truth side of patterns/event-driven-config-refresh (paired with DynamoDB on the high-frequency side of patterns/tagged-storage-routing).
- systems/aws-sns, systems/aws-sqs — the pub/sub and queue primitives in the same substrate.
Other AWS / Amazon systems referenced across sources:
- Most AWS service lineage lives on companies/allthingsdistributed — S3, EBS, Nitro, Lambda, Firecracker, Aurora DSQL, SageMaker, Bedrock Guardrails, Kiro. Cross-reference there.
Relational databases / Postgres family:
- systems/aws-rds — managed relational (MySQL / Postgres / MariaDB / SQL Server / Oracle); Multi-AZ cluster Postgres inherits community Postgres's Long Fork visibility anomaly (Jepsen 2025-04-29; AWS response 2025-05-03).
- systems/postgresql — the upstream substrate; visibility model
(
ProcArrayscan, asynchronous with WAL commit) is the root cause. AWS's PostgreSQL Contributors Team (formed 2022) is co-developing the proposed CSN upstream fix. - systems/aurora-dsql — ground-up distributed SQL; replaces
ProcArrayvisibility with time-based MVCC, sidestepping Long Fork. Wire-compatible Postgres via public extension API. - systems/aurora-limitless — horizontally-scaled Aurora Postgres;
also replaces
ProcArraywith time-based MVCC.
Service mesh / container networking:
- systems/aws-app-mesh — AWS's first-gen Envoy-based sidecar service mesh for ECS/EKS/Fargate. Discontinued 2026-09-30, closed to new customers 2024-09-24. Four-tier abstraction (Mesh / Virtual Service / Virtual Router / Virtual Node) + customer- managed Envoy sidecar per Task.
- systems/aws-ecs-service-connect — current managed replacement for ECS. Flat Client/Server role model, AWS-managed Service Connect Proxy (Envoy under the hood), free CloudWatch app-level metrics. Not yet mTLS-capable (2025-01-18).
- systems/aws-vpc-lattice — current replacement for EKS. Not a sidecar mesh — VPC-level service-networking managed control + data plane across EKS / EC2 / Lambda / on-prem.
- systems/amazon-ecs — compute substrate under both meshes; Service ↔ Task ↔ Task Definition abstraction; exclusive mesh-membership constraint is load-bearing for migration.
- systems/aws-cloud-map — shared service-discovery substrate. Cross-account namespace sharing is not supported, forcing single-account deployments in Service Connect.
- systems/aws-private-ca — TLS certificate authority under both meshes; App Mesh uses general-purpose certs, Service Connect uses short-lived certs (cheaper).
- systems/amazon-route53 — DNS weighted-routing primitive for blue/green mesh migration edge traffic shifting.
Partitions / cross-partition sovereign-failover architecture:
- systems/aws-european-sovereign-cloud — the 2026 EU-resident partition; mandatory separate Organization (cannot be paired into a commercial Organization the way GovCloud can be).
- systems/aws-govcloud — 2011 US public-sector partition (FedRAMP / ITAR); the cross-partition-architecture precedent for European Sovereign Cloud; GovCloud accounts can optionally be invited into a commercial Organization.
- systems/aws-iam — per-partition identity; credentials don't cross partitions; the load-bearing primitive forcing explicit cross-partition auth design.
- systems/aws-sts — per-partition regional endpoints; one of the five named cross-partition auth tactics.
- systems/aws-organizations — per-partition topology; European-Sovereign-Cloud-strict / GovCloud-optional separation asymmetry.
- systems/aws-control-tower — commercial-partition governance; cannot manage GovCloud or European Sovereign Cloud accounts.
- systems/aws-direct-connect — dedicated-line cross-partition connectivity; PoP-to-PoP partner connections as the regulated- workload shape.
- systems/aws-transit-gateway — per-partition; inter-region peering does not function across partitions.
- systems/aws-privatelink — prescribed secure cross-partition communication primitive on top of the network layer.
- systems/aws-config, systems/aws-security-hub — per- partition monitoring / posture substrates.
- systems/aws-secrets-manager — IAM-user credential storage in the cross-partition auth IAM-user fallback pattern.
PKI:
- systems/aws-private-ca — per-partition CA; cross-partition mTLS requires double-signed (cross-signed) root CAs.
Disaster recovery / resilience (within a partition, cross-Region + cross-account):
- systems/aws-backup — unified data-protection control plane tying together per-service backup mechanisms (RDS / EBS / S3 / Aurora etc.) behind vaults + policies + schedules; added first- party backup coverage for services that lacked it (EFS, FSx) and cross-Region backup for DynamoDB; canonical backup-and-restore tier primitive on the DR ladder.
- systems/aws-elastic-disaster-recovery (AWS DRS) — continuous block-level replication, recovery orchestration, automated server conversion; seconds RPO, 5–20 min RTO typical; target VPC configuration on recovery; the canonical pilot-light / warm-standby enabling primitive.
- systems/arpio — AWS Resilience Competency Partner SaaS; full-
workload discovery + backup + cross-Region cross-account recovery
on top of AWS Backup + AWS DRS + service-native primitives;
140 AWS resource types covered; the named DR config-translation layer via Route 53 private-hosted-zone CNAMEs.
Event-driven architecture / org-scale pub/sub:
- systems/amazon-eventbridge — managed serverless event bus; content-based routing rules + schema registry / discovery + cross-account targets via resource policies; the canonical AWS substrate for event-driven architecture at organisation scale. Load-bearing gap vs a strict- validation requirement: no native schema validation.
- systems/amazon-key — physical-access-management product family (In-Garage Delivery, apartment-building access); production instance of patterns/single-bus-multi-account on EventBridge plus a custom schema repository + client library + CDK subscriber constructs library. Reported 2,000 events/s / 99.99% success / 80ms p90 / 14M subscriber calls post-migration; integration time for new use cases 5d → 1d.
- systems/aws-cdk — IaC substrate for the reusable subscriber
constructs pattern — per-subscriber event bus + cross-account IAM +
monitoring + alerting packaged behind a ~5-line
new Subscription(...)construct.
Multi-account SaaS platform (account-per-tenant):
- systems/aws-stacksets — AWS's fan-out deployment primitive: one CloudFormation template, many target accounts/Regions from a central admin account. Load-bearing for account-per-tenant CI/CD at ProGlove's ~6,000-account scale. Named failure modes: partial rollouts, pipeline duration, tooling maturity edge cases.
- systems/aws-codepipeline — central orchestration point for fan-out deployment; single execution triggers a single StackSet update that fans out in parallel.
- systems/aws-cloudformation — the underlying declarative IaC engine under both StackSets and CDK.
- systems/aws-step-functions — account-creation orchestrator in ProGlove's lifecycle; account-retirement deliberately kept as scripts (architectural asymmetry is the signal).
- systems/aws-cost-explorer — transparent per-tenant cost attribution by virtue of the account boundary being the billing boundary; key benefit of account-per-tenant for consumption- priced SaaS.
- systems/aws-observability-access-manager — AWS-native cross-account CloudWatch observability primitive; ProGlove built its own third-party aggregation before OAM shipped, now the recommended starting point for new platforms.
- systems/proglove-insight — ProGlove's SaaS platform; the canonical wiki production reference for concepts/account-per-tenant-isolation on AWS (~6,000 tenant accounts, 3-person platform team, ~1M Lambda functions).
Internal developer platform / platform engineering on EKS:
- systems/santander-catalyst — Santander's in-house IDP on AWS EKS — canonical wiki production reference for platform engineering at large-enterprise regulated-industry scale (160M+ customers, 200+ critical systems, billions of daily transactions). Co-built with AWS ProServe via the Platform Strategy Program (PSP). Provisioning cycle 90 days → hours / minutes; PoC prep 90 days → 1 hour; 100+ pipelines consolidated; GenAI agent stack 105 days → 24 hours; ~3,000 monthly data-experimentation tickets eliminated.
- systems/crossplane — CNCF universal resource provisioner; every cloud / SaaS resource modeled as a K8s CR reconciled by a controller; XRDs + Compositions as the composability primitive. Catalyst's stacks catalog.
- systems/argocd — CNCF GitOps continuous-delivery controller for Kubernetes; Git as the source of truth; continuous-reconcile loop. Catalyst's data-plane claims component.
- systems/open-policy-agent — CNCF policy engine (Rego) + Gatekeeper K8s admission controller; enforces compliance + security at admission time. Catalyst's policies catalog; the regulated-bank analogue of SCPs in ProGlove.
- systems/aws-eks — also serves as the infrastructure control plane cluster hosting Crossplane + ArgoCD + OPA; a fundamentally different role from app-compute EKS (Figma, Convera).
- systems/databricks — named integration target in Catalyst's modern data platform workload (built-in integration).
AI-for-ops / AI-powered incident response:
- systems/aws-devops-agent — AWS's fully managed autonomous AI agent for EKS incident investigation and preventive recommendations. Built on Amazon Bedrock; accessed through a purpose-built web UI behind an Agent Space (tenant configuration unit — IAM + IdP + data-source endpoints + scope). AWS vendor peer to Datadog's Bits AI SRE on the same category axis (hosted agent for live-telemetry incident investigation), with a different vendor relationship (AWS managed service scoped to AWS cloud resources). Canonical wiki reference for telemetry-based Kubernetes resource discovery — agent combines a Kubernetes API scan (graph nodes) with OpenTelemetry-derived runtime relationships (graph edges) into a fused dependency graph used for root-cause analysis.
- systems/strands-agents-sdk — AWS's open-source Python SDK for agentic systems (multi-agent orchestration, MCP tool calling, session management); used in the self-build alternative to the DevOps Agent — the Strands variant of the 2025-12-11 conversational-observability blueprint — hosting three specialized agents (Orchestrator / Memory / K8s Specialist).
- systems/eks-mcp-server — AWS-Labs-published MCP server exposing Kubernetes / EKS operations as standardized MCP tools; the agent-native interface to a cluster in the Strands variant of the 2025-12-11 blueprint.
- systems/fluent-bit — CNCF telemetry forwarder running as a cluster DaemonSet; ingest tier of the telemetry-to-RAG pipeline in the RAG variant of the 2025-12-11 blueprint (Fluent Bit → Kinesis → Lambda + Bedrock embeddings → OpenSearch Serverless).
- systems/amazon-kinesis-data-streams — AWS's managed durable streaming substrate; ingest-buffer tier of the same telemetry-to-RAG pipeline. Enables Lambda batching as the primary cost lever at the embedding-generation layer.
- systems/amazon-bedrock — managed foundation-model runtime underlying the DevOps Agent.
- systems/amazon-managed-prometheus — metrics data source (one of four canonical Agent-Space data sources).
- systems/aws-x-ray — traces data source (one of four canonical Agent-Space data sources).
Containers — EKS + Auto Mode + peer AWS services:
- systems/eks-auto-mode — managed-data-plane variant of EKS; AWS operates Bottlerocket nodes, default add-ons, cluster upgrades; customer retains node-pool policy + disruption-budget- guarded upgrade contract. Canonical Kubernetes-layer instance of concepts/managed-data-plane.
- systems/bottlerocket — container-optimised Linux distro; default AMI under EKS Auto Mode; immutable root + A/B transactional updates.
- systems/amazon-guardduty — managed threat-detection with EKS protection + runtime monitoring + CloudTrail + malware detection → MITRE ATT&CK-annotated multistage attack findings.
- systems/amazon-inspector — managed vulnerability scanner; ECR-image-to-running-container mapping enables runtime vulnerability prioritisation by actual production exposure.
- systems/aws-network-firewall — managed stateful firewall; SNI-based egress allow-listing at per-VPC scale is the canonical concepts/egress-sni-filtering pattern; 2025-11-26 EVS post surfaces the centralised-inspection shape (native TGW attachment, Appliance Mode auto-enabled, Domain-list FQDN rule groups) for hub-and-spoke deployment across many VPCs + on-prem via DXGW.
- systems/amazon-evs — managed VMware Cloud Foundation (VCF) stack running on EC2 bare-metal inside a customer VPC; target for lift-and-shift VMware migrations; NSX overlay + vSAN + vMotion all integrated with AWS-native networking.
- systems/aws-vpc-route-server — BGP-speaking VPC primitive; bridges overlay networks (NSX inside EVS) to AWS-native VPC route tables so TGW / Network Firewall can route to overlay CIDRs.
- systems/external-secrets-operator — CNCF K8s operator that
syncs from Secrets Manager to
native K8s
Secretobjects (env-var consumption path; no volume mounts or daemonsets). - systems/amazon-managed-grafana — managed Grafana; Generali uses with CloudWatch data source for per-namespace tenant dashboards.
- systems/generali-malaysia-eks — Generali Malaysia's EKS platform as a synthesized case study (Malaysian insurance customer): six peer-AWS-service integration surface + stateless- only + immutable pods + Helm + HPA discipline.
- systems/karpenter — CNCF open-source Kubernetes node autoscaler, AWS-originated; canonical wiki production reference is Salesforce's 1,000-cluster / 1,180-node-pool migration (2026-01-12). Solves multi-minute scaling latency, subnet-pinned provisioning, poor AZ balance, and rigid node-group boundaries of the predecessor CA / ASG stack.
- systems/cluster-autoscaler — CNCF predecessor autoscaler that Karpenter is displacing on AWS; indirection through ASGs produces minutes-scale latency, thousands of rigid node groups, poor AZ balance.
- systems/aws-auto-scaling-groups — AWS EC2 capacity primitive underneath Cluster Autoscaler; Karpenter bypasses.
- systems/salesforce — customer with the largest known EKS fleet (1,000+ clusters / 1,180+ node pools); canonical wiki Karpenter-at-extreme-scale production reference.
Key patterns / concepts introduced via AWS blog sources¶
Computer vision + GenAI at scale:
- patterns/serverless-driver-worker — canonical instance in the AWS safety-monitoring solution; driver orchestrates, per-use-case workers scale + fail independently; each worker chain is SNS → SQS → SageMaker endpoint with its own DLQ. Inference acts as gatekeeper filtering image volume so Aurora isn't overwhelmed.
- patterns/multilayered-alarm-validation — four-stage composition (object detection → zone overlap → loiter-time persistence → confidence + RLE-mask validation) that turns per-frame detections into auditable alarms.
- patterns/alarm-aggregation-per-entity — per-(entity, use-case) rollup; append new occurrences to open records; scheduled auto-close on resolution; SLA escalation through per-zone preferred channels.
- patterns/data-driven-annotation-curation — Athena-driven FP- rate aggregation + below-threshold-confidence sampling + Claude multi-modal analysis of misclassified samples for class imbalance; replaces blanket per-site daily annotation.
- patterns/synthetic-data-generation — GLIGEN + SageMaker Batch Transform producing auto-annotated training data at 75K-image scale per use case; YOLOv8 hits 99.5% mAP@50 for PPE without any manually-annotated real images.
- patterns/multi-account-isolation — workload-purpose-axis separation (training / ingest / web-app / analytics each in distinct AWS accounts); distinct from [[concepts/account-per- tenant-isolation]] which is tenant-axis. PII containment + blast-radius + compliance alignment.
-
concepts/alert-fatigue — named failure mode the alarm- aggregation + multilayered-validation stack is designed around.
-
concepts/copy-on-write-merge — the compaction strategy that Amazon BDT ran at exabyte scale in-house before the open table formats canonicalised the name.
- concepts/change-data-capture — the upstream workload shape driving all of this.
- concepts/task-and-actor-model — Ray's programming model, the specialist-enabling lower layer vs Spark's dataflow abstraction.
- concepts/locality-aware-scheduling, concepts/zero-copy-sharing, concepts/memory-aware-scheduling — the Ray-mechanism concepts that make specialist hand-crafted distributed algorithms beat generalists on specialist workloads.
- concepts/managed-data-plane — the operational-ownership-on- the-data-plane primitive that distinguishes Service Connect / VPC Lattice from App Mesh; canonical AWS instance of the control-plane-vs-data-plane orthogonal axis.
- concepts/mutual-tls — notable feature gap in Service Connect vs App Mesh at EOL-transition time; blocks regulated workloads from simple lift-and-shift.
- patterns/managed-sidecar — AWS-managed Service Connect Proxy vs customer-managed App Mesh Envoy sidecar; narrowed configurability (timeouts only) for full vendor-operated lifecycle.
- patterns/blue-green-service-mesh-migration — forced pattern for App Mesh → Service Connect because an ECS Service can't be in both meshes; edge traffic shifting via Route 53 / CloudFront continuous deployment / ALB multi-target-group.
- patterns/shadow-migration — the canonical dual-run reconciliation pattern, instantiated across Amazon BDT's multi-year Spark → Ray migration.
- patterns/subscriber-switchover — the per-consumer cutover pattern that earns rollback granularity after shadow migration.
- patterns/heterogeneous-cluster-provisioning — Amazon BDT's EC2 capacity pattern: discover an instance-type set, provision whichever is most available, keep workloads arch/hardware-agnostic.
- patterns/reference-based-copy-optimization — the "don't rewrite files the compaction didn't touch" optimisation that is a named contributor to Amazon BDT's 82% cost-efficiency gain.
Multi-tenant configuration services (tagged-storage pattern):
- patterns/tagged-storage-routing — Strategy-Pattern factory dispatches storage requests to the best-fit backend based on the request key's prefix; adding a new backend is one new class
- one map entry; canonical AWS pair is DynamoDB (high-frequency per-tenant) + Parameter Store (shared hierarchical).
- patterns/event-driven-config-refresh — EventBridge + Lambda + Cloud Map + gRPC pipeline pushes config updates into live service instances' in-memory caches within seconds without polling or restart; escape valve from the TTL-vs-staleness dilemma for shared-config workloads.
- patterns/jwt-tenant-claim-extraction — tenant context
sourced exclusively from the validated Cognito JWT's immutable
custom:tenantIdclaim;tenantIdin request bodies / paths / headers is never read; cross-tenant access via body manipulation structurally impossible. - concepts/cache-ttl-staleness-dilemma — the forcing function the tagged-storage + event-driven-refresh composite resolves; TTL-based caches for rapidly-changing tenant metadata force an unacceptable stale-vs-amplified-load trade-off at multi-tenant scale.
Postgres consistency-model work:
- concepts/snapshot-isolation — the formal model Postgres's clustered implementation does not guarantee (surfaced by Jepsen 2025-04-29, acknowledged by AWS 2025-05-03).
- concepts/long-fork-anomaly — the specific SI violation Postgres exhibits; concurrent non-conflicting transactions observed in different orders by primary + replica.
- concepts/visibility-order-vs-commit-order — the mechanism:
Postgres's commit path writes the WAL record, then
asynchronously removes the xid from
ProcArray. - concepts/commit-sequence-number — the proposed upstream fix; multi-patch effort, PGConf.EU 2024 talk, AWS PostgreSQL Contributors Team participating.
AI trust / automated-reasoning productization:
- systems/bedrock-guardrails-automated-reasoning-checks — Bedrock safeguard that formally verifies LLM outputs against a customer-authored policy; preview-launched 2024-12-04 in US West (Oregon).
- concepts/autoformalization — natural-language → formal-spec translation pipeline; first public disclosure in the 2024-12-04 preview-launch post (document → concepts → units → logic → logical model); variable descriptions as the load-bearing accuracy-tuning surface.
- patterns/post-inference-verification — the canonical pattern Bedrock Guardrails AR checks productizes; three-verdict output (Valid / Invalid / Mixed) with structured suggestions; regenerate-with-feedback loop feeds the reasoner's natural-language rule descriptions back to the LLM as corrective prompts.
Digital sovereignty / cross-partition failover architecture:
- concepts/aws-partition — logically isolated group of AWS Regions with its own IAM; hard boundary for credentials, cross-region primitives, and service availability. The central primitive in sovereign-failover design.
- concepts/digital-sovereignty — demand-side framing: "managing digital dependencies — deciding how data, technologies, and infrastructure are used, and reducing the risk of loss of access, control, or connectivity." The human-driven-disaster class that pushes you across the partition boundary.
- concepts/disaster-recovery-tiers — backup / pilot light / warm standby / active-active canonical AWS DR ladder; same ladder applied across the partition axis, with pilot-light the cross-partition default.
- concepts/cross-partition-authentication — because IAM credentials don't cross, auth is explicit: IAM roles with trust + external IDs, STS regional endpoints, resource-based policies, cross-account roles via Organizations, federation from a centralized IdP (best practice).
- concepts/cross-signed-certificate-trust — "double-signed certificates" — per-partition root CAs cross-sign each other to enable authenticated cross-partition mTLS without violating partition isolation.
- patterns/cross-partition-failover — the overarching pattern: duplicate infrastructure across partitions + one of the DR tiers + per-partition IAM / PKI / Organizations / networking.
- patterns/pilot-light-deployment, patterns/warm-standby-deployment — two specific DR tiers endorsed for cross-partition.
- patterns/centralized-identity-federation — federate from a single IdP to all partitions; modern best practice for cross-partition auth; avoids per-partition IAM users.
Disaster recovery / resilience (within-partition):
- concepts/rpo-rto — the two DR budget dimensions; AWS DRS quantified at seconds RPO / 5–20 min RTO, AWS Backup at hours RPO / RTO; tier choice derived from the business-set RPO/RTO targets.
- concepts/crash-consistent-replication — block-level replica equivalent to a crash+reboot of the source; strictly weaker than app-consistent but achievable continuously — the consistency model AWS DRS uses for its seconds-RPO guarantee.
- concepts/cross-region-backup — fault-isolation axis (natural/technical disasters); the baseline multi-Region backup-copy primitive unified under AWS Backup.
- concepts/cross-account-backup — compromise-isolation axis (ransomware / malware / malicious insider); AWS Backup cross-account copy is the unified primitive; clean-room recovery account is the canonical target.
- concepts/clean-room-recovery-account — separate AWS account with distinct credentials as a ransomware/malware isolation boundary; sibling use of the AWS account boundary alongside concepts/account-per-tenant-isolation.
- concepts/dr-config-translation — restored resources have new identifiers (endpoints, ARNs); canonical mechanism is Route 53 private-hosted-zone CNAME indirection so applications keep resolving the old name to the new endpoint without config rewrites.
- patterns/block-level-continuous-replication — the continuous seconds-scale replication pattern AWS DRS implements; enables pilot-light + warm-standby tiers at seconds RPO.
- patterns/backup-and-restore-tier — the lowest DR tier on the ladder; AWS Backup + EventBridge + Lambda automation; hours-scale RPO/RTO, near-zero steady-state cost.
Event-driven architecture / schema governance:
- concepts/event-driven-architecture — architectural style where services communicate via asynchronous events on a shared bus; supersedes ad-hoc SNS / SQS pairs at org scale. The canonical AWS substrate is EventBridge.
- concepts/service-coupling — framing for the failure mode EDA addresses: tight-coupling cascade deadlocks. Amazon Key pre-migration exhibited exactly this — Service-A issues → timeouts + retries amplifying load → cross-service deadlock; single-device-vendor issues causing fleet-wide degradation.
- concepts/schema-registry — versioned contract store for event definitions; single source of truth for publishers and subscribers. EventBridge has a schema registry but no native validation; strict-validation customers build on top.
- patterns/single-bus-multi-account — one shared event bus in a central account + per-service-team accounts; DevOps owns bus + rules + targets, service teams own application stacks; logical separation via rules, not buses. AWS reference pattern.
- patterns/client-side-schema-validation — validate events in a shared client library rather than a centralized validation service; immediate developer feedback + no runtime network hop; addresses EventBridge's missing native-validation gap.
- patterns/reusable-subscriber-constructs — package subscriber infra as a versioned IaC construct library (CDK) — dedicated event bus + cross-account IAM + monitoring + alerting from ~5 lines. Amazon Key reports publisher/subscriber integration time 40h → 8h.
Fine-grained application authorization:
- systems/amazon-verified-permissions — managed
Cedar policy engine for application
authorization; the application-authz counterpart to
IAM.
IsAuthorizedsynchronous evaluation at "millisecond-level"; submillisecond end-to-end when fronted by API Gateway's authorizer-decision cache. Per-tenant policy stores are the idiomatic SaaS isolation primitive. - systems/cedar — the policy language, public extraction of AWS's decade of internal policy-semantics work (see systems/aws-policy-interpreter). Analyzable by design. Combines RBAC + ABAC + ReBAC in one language.
- systems/amazon-cognito — identity substrate paired with AVP across Convera's four authorization flows; user pool for customers, machine-to-machine user pool for service-to-service, per-tenant pool for multi-tenant. Pre-token-generation Lambda hook enriches JWTs at issue time.
- systems/amazon-api-gateway — ingress tier hosting the patterns/lambda-authorizer; built-in authorizer-decision cache delivers submillisecond repeat-request latency.
- systems/okta — external enterprise IdP; federated-to by Cognito in Convera's internal-user flow (patterns/centralized-identity-federation).
Fine-grained application authorization — concepts / patterns:
- concepts/fine-grained-authorization — per-resource, per-action, context-aware authorization (vs coarse-grained role-to-endpoint); the evaluation model Cedar + AVP deliver.
- concepts/attribute-based-access-control — ABAC as the idiomatic fine-grained authz realization; Cedar combines ABAC with RBAC and ReBAC in one language.
- concepts/policy-as-data — Cedar policies in a DynamoDB source of truth + DynamoDB Streams continuously sync into AVP policy stores; authorship gated by a regulated IAM role owned by infosec.
- concepts/tenant-isolation — five-layer enforcement chain for Convera's multi-tenant SaaS (identity → token → authorization → routing → data); a bug in any one layer can't leak across tenants.
- concepts/zero-trust-authorization — every tier that handles a privileged request independently re-verifies; production instance in Convera's backend pods that re-call AVP before hitting RDS.
- concepts/authorization-decision-caching — two-level cache (API Gateway authorizer-decision + app-level Cognito token) delivers submillisecond repeat-request latency.
- concepts/token-enrichment — push per-user attribute lookup off the hot path by injecting attributes into the JWT at issue time (via the pre-token hook).
- patterns/lambda-authorizer — Lambda in front of API Gateway evaluating Cedar via AVP; the hot-path authz compute across all four Convera flows.
- patterns/per-tenant-policy-store — AVP idiom for multi-tenant
SaaS: one policy store per tenant,
tenant_id → policy-store-idlookup from DynamoDB. Chosen for isolation, per-tenant schema/template customization, easy onboarding/offboarding, and per-tenant resource quotas. - patterns/pre-token-generation-hook — Cognito Lambda trigger that enriches the JWT with authorization-relevant attributes from RDS / DynamoDB at login time.
- patterns/zero-trust-re-verification — backend re-runs AVP against the tenant's policy store before data access; data layer (RDS) is further configured to accept only tenant-scoped requests.
- patterns/machine-to-machine-authz — same Lambda-authorizer shape reused for service-to-service via Cognito's OAuth client-credentials flow; per-service policy stores.
Internal developer platform / platform engineering at enterprise scale:
- patterns/platform-engineering-investment — second canonical production instance on the AWS blog (after ProGlove) via Santander Catalyst; large-enterprise regulated-industry counterpart to ProGlove's small-team SaaS instance. Kubernetes- native substrate on EKS instead of AWS-Organizations-native.
- patterns/developer-portal-as-interface — Santander's in-house developer portal as the unified self-service surface hiding EKS / Crossplane / ArgoCD / OPA behind one interface; "Platform APIs become the internal product" in concrete form.
- patterns/crossplane-composition — XRDs + Compositions as the unit of reuse for the stacks catalog; Kubernetes-native realization of patterns/golden-path-with-escapes at multi-cloud-infrastructure level.
- patterns/policy-gate-on-provisioning — OPA Gatekeeper as a K8s admission controller enforcing compliance + security on every Crossplane claim at manifest-submission time; shift-left compliance; the regulated-industry counterpart to SCPs in ProGlove's AWS-Organizations-based shape.
- concepts/universal-resource-provisioning — Crossplane's abstraction: every cloud / SaaS resource as a K8s CR reconciled by a controller; uniform API + RBAC + GitOps across clouds.
- concepts/gitops — Git as declarative source of truth, continuous-reconcile controllers; ArgoCD the canonical K8s-native realization; Catalyst's application-delivery contract.
- concepts/control-plane-data-plane-separation — Catalyst's first wiki instance of the split at infrastructure- provisioning tier: EKS cluster decides, provisioned AWS (and multi-cloud) resources are the data plane.
AI-for-ops / AI-powered incident response:
- concepts/telemetry-based-resource-discovery — AWS DevOps Agent's core methodology: combine a Kubernetes API scan (the graph nodes: Pods / Deployments / Services / ConfigMaps / Ingress / NetworkPolicies with their metadata) with OpenTelemetry-derived runtime relationships (the weighted edges: service-mesh traffic, distributed traces, metric attribution) into a fused dependency graph used for investigation. Neither path alone is sufficient — the static API gives you the graph, telemetry tells you which edges are alive and misbehaving.
- concepts/agentic-troubleshooting-loop — iterative LLM ↔ tool-assistant investigation cycle; LLM proposes diagnostic queries, tool assistant executes them against live system state, output re-enters LLM context, repeats until the LLM judges enough context for resolution. Canonical wiki reference is the 2025-12-11 AWS conversational-observability blueprint; the 2026-03-18 AWS DevOps Agent post is the managed-service realization of the same primitive with a structured discovery step added on top.
- patterns/telemetry-to-rag-pipeline — streaming operational telemetry into a vector store for LLM augmentation; canonical AWS shape is Fluent Bit → Kinesis Data Streams → Lambda (batched) + Bedrock Titan Embeddings v2 → OpenSearch Serverless (hot) or S3 Vectors (cold). Sanitize-before- embedding is named as the vector-store governance boundary.
- patterns/allowlisted-read-only-agent-actions — constrain an LLM-driven agent's side effects to a static allowlist of read-only verbs (kubectl get / describe / logs / events) + platform-layer RBAC. Canonical AWS realization is the in- cluster troubleshooting assistant pod in the 2025-12-11 blueprint; defense-in-depth via two-layer enforcement (app allowlist + K8s RBAC).
Agentic AI development (developer-side feedback loops):
- concepts/agentic-development — development model where the AI agent "writes, tests, deploys, and refines code through rapid feedback cycles", not just suggests snippets. Inner-loop driver, not outer-loop. The 2026-03-26 AWS post's central reframing: agentic coding is gated on architecture, not prompt quality.
- concepts/fast-feedback-loops — the primary architectural constraint of agentic development; each unvalidated change should use the cheapest tier that can falsify it. Five tiers named: local emulation → offline data/ML dev → hybrid cloud → preview env → production deploy.
- concepts/local-emulation — umbrella concept over
SAM
sam local start-api, same-image container run, DynamoDB Local, Glue Docker images. Cheapest feedback tier; API-shape parity with real services. - concepts/contract-first-design — OpenAPI specifications authored upfront so agents validate integrations before sibling services are implemented; pairs with preview environments.
- concepts/hexagonal-architecture — codebase layer discipline
(
/domainno Amazon deps,/applicationorchestration,/infrastructureadapters). The precondition that makes domain- layer unit tests run without cloud credentials. - concepts/project-rules-steering — architectural constraints /
coding conventions as Markdown the agent consults automatically.
First AWS source pinning
.kiro/steering/*.md+ Markdown format — Kiro's concrete surface. - concepts/machine-readable-documentation — AGENT.md / RUNBOOK.md / CONTRIBUTING.md + YAML-over-prose as the broader design principle; project rules as one realization.
- patterns/local-emulation-first — prefer local emulator over cloud deployment as the default feedback path; canonical four realizations (SAM / containers / DynamoDB Local / Glue Docker).
- patterns/hybrid-cloud-testing — for services without local emulators (SNS / SQS named), define minimal CFN / CDK stacks and invoke via SDK. Cloud is "another test dependency — used sparingly and predictably".
- patterns/ephemeral-preview-environments — short-lived IaC- defined stacks, deployed on demand, torn down after E2E validation. The above-hybrid-cloud tier.
- patterns/layered-testing-strategy — unit (domain, fast) / contract (interfaces) / smoke (deployed env). Each tier catches a distinct failure class.
- patterns/tests-as-executable-specifications — tests do more than catch regressions — they define acceptable behavior; a failing test teaches the agent what's expected. Sibling of patterns/executable-specification at the test-suite tier.
- patterns/ci-cd-agent-guardrails — required tests + automated reviews + branch protections + preview-env validation + human gates for high-impact changes; expand agent autonomy as confidence compounds.
Recent articles¶
Most recent first; ingested AWS blog posts, not republications from companies/allthingsdistributed.
-
2026-04-07 — AWS News Blog, Launching S3 Files, making S3 buckets accessible as file systems (Sébastien Stormacq). The customer-visible launch announcement for Amazon S3 Files — operational/product-launch companion to Warfield's same-day design essay on companies/allthingsdistributed. Key concrete numbers this post owns (the design essay omitted them): NFS v4.1+ as the wire protocol, ~1 ms latency for active data on the high-performance storage tier, and Amazon EFS confirmed explicitly as the under-the-covers backing for that tier. Introduces the Mount Target as a new deployment primitive — a VPC network endpoint between compute (EC2 / ECS / EKS / Lambda) and the S3 file system; console auto-creates mount targets, CLI requires two commands (
create-file-system+create-mount-target). Mount syntax is two shell lines (mount -t s3files fs-<id>:/ <mount-point>). Performance-tier split inside S3 Files: files that benefit from low-latency access are stored/served from the high-performance (EFS-backed) tier; files needing large sequential reads are served directly from Amazon S3 to keep the full S3 throughput envelope without leavingread(2). Byte-range reads transfer only requested bytes for random-access patterns. Intelligent pre-fetching with per-file customer control over load-full-data-or-metadata-only (the concepts/lazy-hydration surface from the design side). Bidirectional sync cadence: file → S3 within minutes; S3 → file within seconds, occasionally a minute+ (matches concepts/stage-and-commit's ~60 s commit interval from the design side). Amazon FSx positioning made explicit — S3 Files targets interactive shared access to S3-resident data; [[systems/ aws-fsx|FSx]] remains the path for on-prem NAS migration, HPC/GPU cluster storage (Lustre), and NetApp ONTAP / OpenZFS / Windows File Server-specific capabilities. Agentic AI called out as a first-class target workload — "building agentic AI systems that rely on file-based Python libraries and shell scripts" — matching Warfield's "coding agent reasoning over a dataset" framing. Pricing is four line items: file-system storage, small-file reads, all writes, and S3 requests during sync (large-file direct-S3 reads fall under normal S3 GET pricing). GA in all commercial AWS Regions at launch. Customer-facing launch — no SLOs, no breakdown of the high-performance-vs-direct-S3 per-file routing heuristic, no 50 M-object mount warning or 30-day eviction mentioned (those are in Warfield's essay). Extends systems/s3-files, systems/aws-efs, systems/aws-fsx, systems/aws-s3; extends concepts/file-vs-object-semantics, concepts/boundary-as-feature, concepts/stage-and-commit, concepts/lazy-hydration; extends [[patterns/presentation-layer- over-storage]], patterns/explicit-boundary-translation. Source: sources/2026-04-07-aws-s3-files-mount-any-s3-bucket-as-a-nfs-file-system-on-ec2-ecs. -
2026-03-26 — AWS Architecture Blog, Architecting for agentic AI development on AWS. Prescriptive essay on how to architect AWS systems so AI coding agents can operate effectively. Thesis reframes "better AI coding" from a prompt problem to an architectural problem: "The solution is not better prompts, it's an architecture that treats fast feedback and clear boundaries as first-class concerns." Two co-equal axes: (1) system architecture for fast agentic feedback — local emulation as default via SAM
sam local start-api(Lambda + API Gateway), same-image container run for ECS / Fargate, DynamoDB Local for CRUD tests, AWS Glue Docker images for data pipelines; offline development for data/ML workloads as the same shape for reduced-data iteration; hybrid cloud testing for services without emulators (SNS/SQS named) — minimal CloudFormation / CDK stacks invoked via SDK, torn down after; preview environments as short-lived IaC stacks for E2E validation; contract-first design via OpenAPI so agents can validate integrations before all services ship. (2) Codebase architecture for AI-friendly development — hexagonal architecture with explicit/domain+/application+/infrastructurelayers (domain has no Amazon deps); project rules / steering files at.kiro/steering/*.md(Kiro's concrete surface — first AWS source to pin the path + Markdown format; example rule: "database access must go through repository classes in the infrastructure layer"); tests as executable specifications across unit + contract + smoke tiers (patterns/layered-testing-strategy); concepts/monorepo + machine-readable docs (AGENT.md / RUNBOOK.md / CONTRIBUTING.md + YAML-over-prose); CI/CD guardrails that scale agent autonomy over time while keeping humans in the loop for high-impact decisions. Prescriptive — no production numbers, no customer case, no retrospective, no steering-file schema / rule precedence / conflict resolution, no positioning vs AWS Well- Architected, no failure-mode taxonomy for emulator-vs-real drift, no cost model for preview environments, no many-agent scaling story. Introduces systems/aws-sam, systems/dynamodb-local, systems/aws-fargate; concepts/agentic-development, concepts/fast-feedback-loops, concepts/local-emulation, concepts/contract-first-design, concepts/hexagonal-architecture, concepts/project-rules-steering, concepts/machine-readable-documentation; patterns/local-emulation-first, patterns/ephemeral-preview-environments, patterns/hybrid-cloud-testing, patterns/layered-testing-strategy, patterns/tests-as-executable-specifications, patterns/ci-cd-agent-guardrails; extends systems/aws-lambda (canonical local-emulation target via SAM), systems/amazon-api-gateway (SAM local-emulation front), systems/amazon-ecs (same-image local-run discipline), systems/aws-glue (data-workload instance of local-emulation- first via Docker images), systems/aws-sns / systems/aws-sqs (canonical no-local-emulator services driving hybrid-cloud- testing), systems/aws-cloudformation / systems/aws-cdk (IaC substrate for hybrid dev stacks + preview environments), systems/aws-iam (runtime-only config surface smoke tests exist to catch), systems/kiro (first AWS source pinning.kiro/steering/path + Markdown format), concepts/monorepo (agent-context enabler), concepts/specification-driven-development (project rules as codebase-hosted corner of the spec-driven stack). → sources/2026-03-26-aws-architecting-for-agentic-ai-development-on-aws -
2025-11-26 — AWS Architecture Blog, Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall. Reference-architecture post on deploying centralised network inspection for Amazon EVS (managed VMware Cloud Foundation on EC2 bare-metal in a customer VPC) using AWS Network Firewall + AWS Transit Gateway. Load- bearing architectural ideas: (1) Network Firewall as a bump-in-the-wire middlebox inserted into the traffic path by route-table updates (VPC + TGW), not application changes; (2) the native TGW ↔ Network Firewall integration (GA July 2025) auto-provisions the inspection-VPC subnets / route tables / endpoints and creates a TGW attachment of resource type Network Function with Appliance Mode automatically enabled — closing the historic stateful-inspection-across-AZs landmine; (3) traffic is forced through the firewall by the [[patterns/pre- inspection-post-inspection-route-tables|pre-inspection / post-inspection TGW route-table split]]: all VPC + Direct Connect Gateway attachments associate with the pre-inspection RT (
0.0.0.0/0→ firewall attach); the firewall attachment associates with the post-inspection RT, which holds per-destination static routes back to each VPC / on-prem CIDR; (4) one topology inspects east-west (EVS↔VPC, VPC↔VPC), north-south (VPC↔on-prem via DXGW, VPC↔internet via dedicated egress VPC with NAT, internet→VPC via dedicated ingress VPC with ALB), and on-prem↔internet; (5) EVS's NSX overlay segments (192.168.0.0/19 summarised) are propagated into AWS-native VPC route tables by Amazon VPC Route Server speaking BGP with the NSX edge — without this, TGW has no route to the VM CIDRs and east-west inspection blackholes for VM traffic. FQDN-based egress filtering via Network Firewall's Domain-list stateful rule group demonstrated (matches SNI for HTTPS, Host header for HTTP) — sibling primitive to the SNI-based egress filtering already documented at the VPC-egress scale. East-west ICMP drop + ingress HTTP alert demonstrate standard stateful 5-tuple rule shapes. Two CloudWatch log groups (alert + flow) as the canonical logging convention. Default route-table association + propagation explicitly deselected on TGW so new attachments can't accidentally bypass inspection. No production scale numbers (RPS, bandwidth, cost) — reference architecture with demo CIDRs. Introduces systems/amazon-evs, systems/aws-vpc-route-server; concepts/centralized-network-inspection, concepts/bump-in-the-wire-middlebox, concepts/tgw-appliance-mode; patterns/pre-inspection-post-inspection-route-tables; extends systems/aws-network-firewall (new centralised- inspection-via-TGW-native-attachment section + FQDN Domain-list rule-group section + logging conventions), systems/aws-transit-gateway (new centralised-inspection hub section + Appliance Mode automatic-enablement), systems/aws-direct-connect (new DXGW-as-TGW-attachment-in- inspection-path section), concepts/egress-sni-filtering (FQDN / Host-header matching as sibling primitive to SNI, centralised-inspection scale-out). -
2026-04-08 — AWS Architecture Blog, Build a multi-tenant configuration system with tagged storage patterns. Reference- architecture post on a multi-tenant configuration service composed of three architectural primitives worth extracting: (1) tagged-storage routing — a Strategy-Pattern factory keyed on the request's key prefix dispatches each call to the appropriate storage backend (
tenant_config_*→ DynamoDB for high- frequency per-tenant reads;param_config_*→ Parameter Store for shared hierarchical config); adding a third backend (Secrets Manager, S3) is one new strategy class plus a map entry; (2) event-driven config refresh — Parameter Store writes emit EventBridge events, a Lambda receives them, queries Cloud Map for healthy Config Service instances, and pushes fresh values via gRPC refresh RPCs into in-place in-memory caches — escape valve from the TTL-vs- staleness dilemma without polling or service restarts; (3) JWT tenant-claim extraction — service never acceptstenantIdfrom request parameters; extracted exclusively from the validated Cognito JWT's immutablecustom:tenantIdclaim, making cross-tenant access structurally impossible via request-body manipulation. Stack: NestJS gRPC services on ECS Fargate in private subnets behind ALB / VPC Link / API Gateway / WAF / Cognito; shared IAM execution role (TVM/STS-based per-tenant credentials pitched as next-step for compliance-grade isolation at 50-100ms-per-op cost). DynamoDB composite keypk = TENANT#{id}/sk = CONFIG#{type}provides data-layer tenant isolation. Multi- dimensional extensionpk = TENANT#{id}|SERVICE#{name}for service-level isolation in a shared account. DAX named for microsecond-latency acceleration at 1000+ RPS (5-10× vs single- digit-ms DynamoDB baseline). Cache-security discipline: tenant- prefixed keys, credentials + PII never cached, JWT validation + DynamoDB composite keys as final enforcement boundary. Positioned as the lightest shape on the wiki's tenant-isolation spectrum (single shared account, app + JWT enforcement; sits between Convera's in-account multi-layer and ProGlove's account-per- tenant). No production scale numbers disclosed — reference architecture with CloudFormation sample on GitHub. Introduces systems/aws-parameter-store; concepts/cache-ttl-staleness-dilemma; patterns/tagged-storage-routing, patterns/event-driven-config-refresh, patterns/jwt-tenant-claim-extraction; extends systems/dynamodb (composite-key tenant isolation + multi-dimensional extension), systems/amazon-cognito (immutable custom attribute detail), systems/amazon-eventbridge (Parameter Store change-event integration), systems/aws-lambda (invalidator compute role), systems/grpc (refresh-RPC transport + internal service-to-service), systems/aws-cloud-map (refresh-time service discovery), systems/amazon-ecs, systems/aws-fargate, systems/aws-secrets-manager (future-extension backend), systems/aws-systems-manager (Parameter Store as sub-service), concepts/tenant-isolation (shape-spectrum lightest-entry row), concepts/event-driven-architecture (config-refresh instance). → sources/2026-04-08-aws-build-a-multi-tenant-configuration-system-with-tagged-storage-patterns -
2026-04-06 — AWS Architecture Blog, Unlock efficient model deployment: Simplified Inference Operator setup on Amazon SageMaker HyperPod. Product-announcement framing for the HyperPod Inference Operator shipping as a native EKS add-on (replacing the prior Helm-chart install path) with a built-in
helm_to_addon.shmigration script. Four architectural primitives worth extracting sit inside the one-click-install marketing frame: (1) multi-instance-type fallback via Kubernetes node affinity —InferenceEndpointConfig.spec.instanceTypestakes a priority-ordered list (["ml.p4d.24xlarge", "ml.g5.24xlarge", "ml.g5.8xlarge"]); compiles torequiredDuringSchedulingIgnoredDuringExecutionto restrict placement +preferredDuringSchedulingIgnoredDuringExecutionwith descending weights for priority; scheduler silently falls back to the next-preferred type on capacity miss — structural answer to GPU-capacity-constrained placement. Raw KubernetesnodeAffinityalso exposed directly on the CRD for custom scheduling (Spot exclusion / AZ preference / custom labels). (2) Managed tiered KV cache as a platform capability (implicit memory hierarchy across GPU HBM / host DRAM / SSD); claimed up to 40% inference-latency reduction for long-context workloads (methodology not disclosed). First wiki instance of managed KV cache as an add-on feature rather than a model-server-library concern. (3) Intelligent routing — three strategies (prefix-aware / KV-aware / round-robin) selected at install time; prefix-aware routes requests sharing a common prompt prefix to the same replica so the KV cache hits on the second request; KV-aware reads live cache-occupancy telemetry before routing. Specialisation of concepts/workload-aware-routing for LLM inference. (4) EKS add-on as lifecycle-packaging — converts the previously-Helm Kubernetes operator + its dependency add-ons (S3 Mountpoint CSI driver / FSx CSI driver / cert-manager / metrics-server) + IAM scaffolding -
S3/VPC-endpoint prereqs into a native EKS add-on with managed version bumps, rollback, and a one-shot
helm_to_addon.shmigration script (auto-discovery of existing Helm config, OVERWRITE flag, backup files in/tmp/hyperpod-migration-backup-<timestamp>/, tags migrated ALBs / ACM certs / S3 objects withCreatedBy: HyperPodInference). Four IAM roles carved on install (Execution Role + JumpStart Gated Model Role + ALB Controller Role + KEDA Operator Role) as the least-privilege default posture the managed installer produces. Two first-class CRDs underinference.sagemaker.aws.amazon.com/v1:InferenceEndpointConfig(bring-your-own-model from S3) andJumpStartModel(managed catalog,modelId+instanceType). Observability via Amazon Managed Grafana — time-to-first-token (TTFT), end-to-end latency, GPU utilization, cache performance, routing efficiency. Quantitative disclosure minimal / un-methodologied: "hours before a single model can serve predictions" → "within minutes of cluster creation" for install; 40% KV-cache latency reduction hedged as "up to" with no baseline. Borderline Tier-1 ingest — product-PR genre, four transferable architectural primitives keep it above the scope filter. Extends concepts/managed-data-plane (Kubernetes-operator- layer instance — sits on same spectrum as App Mesh → Service Connect and EKS Auto Mode), concepts/shared-responsibility-model (responsibility line moves one layer into previously-customer- operated Helm/IAM/dependency scaffolding). → sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod -
2026-04-01 — AWS Architecture Blog, Automate safety monitoring with computer vision and generative AI. Serverless + event-driven CV + GenAI safety-monitoring solution at scale. Target fleet 10,000+ cameras; end-to-end image-capture-to-notification up to 37 s. Core architectural content: (1) serverless driver-worker pattern with one-worker-per-use-case for independent scaling + fault isolation (each use case has its own SNS/ SQS/SageMaker endpoint chain + DLQ); (2) real SageMaker Serverless → Serverful inference pivot at production scale — Serverless inference's no-GPU-support + 6 GB memory ceiling caused OOMs at hundreds of sites → migration to ml.g6 Serverful endpoints + auto scaling + raised Lambda concurrent-execution limits + SQS batch-size tuning; (3) multi-account AWS isolation — training, ingest, web-app, analytics environments each in distinct accounts; raw PII-bearing images purged within days after Rekognition-based anonymisation; (4) four-stage intelligent alarm detection: object detection → zone overlap against "digital tape" (configurable 50% threshold) → loiter-time algorithm tracking same-object persistence via mask similarity across consecutive minute intervals (per-zone acceptable-loiter-time tuning) → multilayered validation (confidence thresholds + RLE mask comparison for cross-interval consistency); (5) per-camera-per-use-case risk aggregation to avoid alert fatigue — append new occurrences to open records; scheduled auto-close on resolution; SLA escalation through per-zone preferred channels (Slack / email / ticket); (6) [[patterns/ data-driven-annotation-curation|data-driven ground-truth curation]] — Athena aggregates false-positive rates across camera types + deployment conditions for prioritised retraining + surfaces below-confidence-threshold inferences + Claude multimodal on Bedrock detects underrepresented object classes (replaces blanket per-site daily annotation jobs that became untenable); (7) GLIGEN-based synthetic-data generation on SageMaker Batch Transform producing a 75,000-image PPE dataset (3 classes)
-
75,000-image Housekeeping dataset (7 classes) at 512×512 with ground-truth bounding boxes auto-embedded; YOLOv8 trained on PyTorch 2.1 + cosine LR + AdamW reached 99.5% mAP@50 / 100% precision / 100% recall for PPE + 94.3% mAP@50 for Housekeeping without a single manually-annotated real image; (8) training-pipeline / model-promotion decoupling — GT job creation via Step Functions triggered by EventBridge cadence + 7-step SageMaker AI Pipelines (load checkpoint → prep+split → train → drift baseline → evaluate → package → register); model-approval EventBridge event triggers a Lambda to open a code review updating the endpoint's S3 URI — so scientists approve on metrics, engineers merge + deploy via CI/CD; (9) tape-labeling synthetic-composite preparation — hourly Step Functions workflow stitches clear portions of multiple time-shifted camera frames using a voting mechanism (pixel regions with no detected objects) into a composite image where floor tapes are fully visible, solving the occluded-tape annotation problem on newly-onboarded cameras. Introduces systems/gligen, systems/yolo, systems/aws-sagemaker-ground-truth, systems/aws-sagemaker-pipelines, systems/aws-sagemaker-endpoint, systems/aws-sagemaker-batch-transform, systems/amazon-rekognition, systems/amazon-aurora, systems/amazon-cloudfront, systems/aws-appsync, systems/aws-waf, systems/amazon-quicksight, systems/amazon-redshift-spectrum; concepts/alert-fatigue; patterns/serverless-driver-worker, patterns/multilayered-alarm-validation, patterns/alarm-aggregation-per-entity, patterns/data-driven-annotation-curation, patterns/synthetic-data-generation, patterns/multi-account-isolation; extends systems/aws-sagemaker-ai, systems/aws-lambda, systems/aws-sns, systems/aws-sqs, systems/aws-s3, systems/aws-step-functions, systems/amazon-eventbridge, systems/dynamodb, systems/amazon-bedrock, systems/amazon-athena, systems/amazon-route53, concepts/event-driven-architecture, concepts/serverless-compute, concepts/tenant-isolation, concepts/blast-radius. → sources/2026-04-01-aws-automate-safety-monitoring-with-computer-vision-and-generative-ai
-
2026-03-31 — AWS Architecture Blog, Streamlining access to powerful disaster recovery capabilities of AWS (co-authored with Arpio, AWS Resilience Competency Partner). Survey-style layered DR architecture: data protection via AWS Backup → compute DR via AWS DRS → whole-workload recovery via partner orchestration. Canonical wiki reference for: (1) the resilience dimension of Shared Responsibility — AWS provides primitives, customer owns orchestration / testing / config translation; (2) cross-Region vs cross-account as orthogonal isolation axes (cross-Region = fault isolation; cross-account = clean-room recovery for ransomware); (3) AWS Backup as unified backup control plane — vaults, policies, schedules; closed native-service gaps for EFS, FSx, and cross-Region backup for DynamoDB; (4) AWS DRS quantified numbers: "crash-consistent RPO of seconds, RTO typically 5–20 min" via continuous block-level replication (concepts/crash-consistent-replication); (5) the DR configuration translation problem — restored resources get new endpoints; canonical mechanism is a Route 53 private hosted zone CNAME mapping old-endpoint → new-endpoint in the recovered VPC so applications don't need config rewrite + redeploy on failover; (6) least-privilege cross-account DR agent pattern (Arpio's IAM role explicitly denied from mutating source OR reading/exfiltrating data). Partner-post classification — ~40% body is Arpio product positioning; architectural signal concentrated in the AWS Backup / AWS DRS / Route 53 CNAME / shared-responsibility sections. Only quantified numbers: seconds RPO, 5–20 min RTO (DRS), >140 AWS resource types (Arpio coverage). No multi- Region active-active discussion, no cost numbers, no DR testing/drill cadence, no cross-partition axis (covered separately by Sovereign Failover). Companion to the Generali EKS Auto Mode customer-case study in the partner/customer marketing-leaning subset of AWS Architecture Blog. →
sources/2026-03-31-aws-streamlining-access-to-dr-capabilities.md -
2026-03-23 — AWS Architecture Blog, How Generali Malaysia optimizes operations with Amazon EKS. Customer-case-study for Generali Malaysia's adoption of EKS Auto Mode (AWS-managed K8s data plane on Bottlerocket nodes with weekly AMI replacement, default-add-on upgrades, cluster-version upgrades). Canonical wiki reference for the peer-AWS-service integration surface of EKS — six managed services wired into one cluster: GuardDuty (EKS protection + runtime monitoring, MITRE ATT&CK-annotated findings), Inspector (ECR-image-to- running-container vulnerability prioritisation), Network Firewall (SNI-based egress allow-list with the private → firewall-public → NAT-protected topology), Secrets Manager + External Secrets Operator (env-var-only secret injection, no volume mounts — aligns with stateless-only discipline), Amazon Managed Grafana (per-namespace dashboards with CloudWatch data source), AWS Billing split cost allocation data for EKS (cluster / namespace / deployment / node native cost-allocation tags). Compound K8s operating discipline stated as platform-wide rules: stateless- only micro-services + immutable pods + Helm charts as standardised deployment mechanism + HPA traffic-driven auto-scaling. Customer-retained safety contract under Auto Mode's platform-driven node churn: Pod Disruption Budgets + Node Disruption Budgets + off-peak maintenance windows. AWS Well- Architected Framework is the explicit organising structure of the post; shared- responsibility model is extended into the K8s data plane. No quantified outcomes published (cluster sizes / cost deltas / MTTR numbers all absent); value is in the integration topology. Introduces systems/eks-auto-mode, systems/amazon-guardduty, systems/amazon-inspector, systems/aws-network-firewall, systems/external-secrets-operator, systems/amazon-managed-grafana, systems/bottlerocket, systems/generali-malaysia-eks; concepts/well-architected-framework, concepts/shared-responsibility-model, concepts/pod-disruption-budget, concepts/egress-sni-filtering; patterns/runtime-vulnerability-prioritization, patterns/eks-cost-allocation-tags, patterns/disruption-budget-guarded-upgrades; extends systems/aws-eks (Auto Mode role), systems/kubernetes (compound stateless/immutable/Helm/HPA discipline), systems/aws-secrets-manager (EKS-native source- of-record role), systems/helm (platform-wide packaging standard), concepts/managed-data-plane (Kubernetes-layer instance), concepts/observability (namespace-per-tenant CloudWatch→AMG shape), concepts/stateless-compute (enterprise- wide platform discipline), companies/aws (Tier 1, 2026-03-23).
-
2026-03-18 — AWS Architecture Blog, AI-powered event response for Amazon EKS. Product-launch post for AWS DevOps Agent (systems/aws-devops-agent), a Bedrock-hosted autonomous AI agent for EKS incident investigation — AWS-managed-service peer to Datadog's Bits AI SRE. Canonical wiki reference for telemetry-based resource discovery: two-path K8s discovery (Kubernetes API for static resource state + OpenTelemetry for runtime relationships — service-mesh traffic, distributed traces, metric attribution) fused into a unified dependency graph the agent reasons over. Four data sources per Agent Space: Managed Prometheus (metrics), Amazon CloudWatch Logs (logs), X-Ray (traces), EKS topology (K8s API). Investigation workflow: scenario-template trigger → data collection → ML/statistical pattern analysis against a learned baseline → confidence-scored root-cause ranking → mitigation recommendations. Separate Prevention surface runs weekly (~15h compute budget) over past investigations for code / observability / infrastructure / governance recommendations. Tutorial-heavy post — kubectl port-forwards + Python traffic-generator + Agent-Space-creation UI walkthrough; 1,806 discovered resources in one demo topology view is the only quantitative number; no eval / correctness / latency / cost numbers disclosed. Preview-product status (demo screenshot shows empty Prevention recommendations).
- 2026-02-26 — AWS Architecture Blog, Digital transformation at Santander: How platform engineering is revolutionizing cloud infrastructure (cowritten with Julio Bando, Santander F1RST). Canonical production reference for platform-engineering at large-enterprise regulated-industry scale: Santander is a global bank (>10 countries, 160M+ customers, 200+ critical systems, billions of daily transactions). Pre-platform, provisioning new infrastructure took up to 90 days and routinely deviated from architecture standards. Solution: Catalyst (systems/santander-catalyst), an in-house internal developer platform co-built with AWS Professional Services through the Platform Strategy Program (PSP). Two load-bearing layers: an in-house developer portal (patterns/developer-portal-as-interface) as the unified self-service surface, and a single EKS control plane cluster hosting three sub-components — data-plane claims managed by ArgoCD for GitOps continuous sync of application stacks; policies catalog using Open Policy Agent (Gatekeeper) as a central repository of compliance + security policies (patterns/policy-gate-on-provisioning, the regulated-industry K8s-admission-time counterpart to SCPs in ProGlove's AWS-Organizations-native shape); and stacks catalog of Crossplane [Composite Resource Definitions
-
Compositions](<../patterns/crossplane-composition.md>) used "as a universal resource provisioner" (concepts/universal-resource-provisioning) "to manage resources across multiple cloud providers consistently and declaratively." This is the first wiki instance of concepts/control-plane-data-plane-separation applied at infrastructure-provisioning tier: the EKS cluster decides, the provisioned AWS (and multi-cloud) resources are the data plane. Reported outcomes: full provisioning cycle 90 days → hours (best case: minutes); standard provisioning 30 days → 2 days; proof-of-concept preparation 90 days → 1 hour; 100+ pipelines consolidated into one control plane; generative AI agent stack implementation 105 days → 24 hours, eliminating "dozens of provisioning tickets per environment". Three workloads cited as evidence Catalyst is a universal platform: (1) generative AI agents stack, the first success case; (2) modern data platform with built-in Databricks integration + data lakes + automated ETL + centralized data catalog + segregated experimentation environments — ~3,000 monthly data-experimentation provisioning tickets eliminated; (3) cloud process orchestration migrating legacy workflows to AWS Step Functions + retry patterns + error handling + centralized process monitoring. Catalyst + ProGlove Insight pin the patterns/platform-engineering-investment spectrum at both ends (regulated bank with K8s-native substrate vs SaaS multi-tenant with AWS-native substrate; same pattern, different substrate). Cultural outcome framed as co-equal with the technical outcome: "Catalyst also catalyzed a cultural change within Santander, promoting an automation and self-service mindset among development teams." Marketing-leaning AWS- Architecture-Blog format — architectural shape stated in full (EKS + Crossplane + ArgoCD + OPA + portal + XRDs) with strong quantified outcomes but no p50/p99 distribution shape, no EKS cluster sizing, no Crossplane Composition examples, no OPA policy examples, no incident retrospective, no post-PSP in-house team size. Introduces systems/santander-catalyst, systems/crossplane, systems/argocd, systems/open-policy-agent, systems/databricks; concepts/universal-resource-provisioning, concepts/gitops; patterns/developer-portal-as-interface, patterns/crossplane-composition, patterns/policy-gate-on-provisioning; extends systems/aws-eks (new role as infrastructure control plane beyond app compute), systems/aws-step-functions (legacy- workflow modernization target), patterns/platform-engineering-investment (second canonical production instance), concepts/policy-as-data (OPA/Rego as the third wiki realization alongside Cedar/AVP and AWS SCPs), concepts/control-plane-data-plane-separation (infrastructure-provisioning tier as new layer), patterns/golden-path-with-escapes (Catalyst's stacks catalog as the multi-cloud-infrastructure- level sibling to Figma's K8s-service-def instance). → sources/2026-02-26-aws-santander-catalyst-platform-engineering
-
2026-02-25 — AWS Architecture Blog, 6,000 AWS accounts, three people, one platform: Lessons learned (cowritten with Julius Blank, ProGlove). Canonical production reference for account-per-tenant isolation on AWS at SaaS-tenant scale. ProGlove's Insight platform (systems/proglove-insight): ~6,000 production AWS accounts, 3-person platform team, >120,000 deployed service instances, ~1,000,000 Lambda functions in production. The AWS account boundary is the sole structural isolation mechanism — no shared compute / storage / network / IAM across tenants — which delivers blast-radius containment, a simplified developer mental model ("a deployed service instance always belongs to exactly one tenant"), per-tenant customization, and transparent cost attribution via Cost Explorer on linked accounts. Well-Architected review simplification — many Operational Excellence / Security pillar items "didn't even apply" because isolation is structural rather than implemented in code. Cost framing: services that scale linearly with account count (smallest EC2 ≈ $3/mo → $3,000/mo at 1,000 accounts) must be avoided; Lambda + DynamoDB scale-to-zero is what makes the architecture economically viable. Deployment: single monorepo → single CodePipeline execution → single StackSet update op → parallel fan-out to all tenant accounts from a central Infrastructure account (patterns/fan-out-stackset-deployment). Named failure modes: partial rollouts (retry/rollback must be defined and tested), pipeline duration (large-scale updates take significant time to propagate), tooling maturity (StackSets "powerful but still evolving"). Account lifecycle asymmetry: creation is fully automated via Step Functions, retirement is manual scripts run regularly — architectural signal that criterion for automation is overhead introduced, not dogma (patterns/automate-account-lifecycle). Baseline guardrails = SCPs + AWS Organizations + strict IAM management. Observability: third-party aggregation application forwarded from all tenant accounts, with multi-alerts defined once and applied across tenant accounts individually (patterns/central-telemetry-aggregation); engineers see a single pane of glass while telemetry originates per-account. Key discipline: don't replicate per-account alarms blindly (use streaming/aggregation), tag everything including source account ID (consider Organizations tag policies), CloudWatch OAM called out as the AWS-native primitive that has since shipped. Per-account quotas become a distributed-monitoring problem — Lambda concurrent-execution quota named canonical case (concepts/per-account-quotas); a "single pane of glass quota tracker" is essential. The three-person team only works because of deliberate platform-engineering investment — complexity shifted from application code to platform tooling; "the team size stays constant, and efficiency grows with every account added." Candid on gaps: "multi-account strategies are common at the enterprise level, adopting them at the SaaS tenant level is less common. Patterns, tooling, and reference architectures are still evolving, which means building custom solutions becomes necessary." No latency / throughput numbers, no incident retrospective; positioned as prescriptive- retrospective co-authored post. Introduces concepts/account-per-tenant-isolation, concepts/blast-radius, concepts/per-account-quotas, concepts/service-control-policy; patterns/fan-out-stackset-deployment, patterns/automate-account-lifecycle, patterns/central-telemetry-aggregation, patterns/platform-engineering-investment; systems/aws-stacksets, systems/aws-codepipeline, systems/aws-step-functions, systems/aws-cost-explorer, systems/aws-observability-access-manager, systems/aws-cloudformation, systems/proglove-insight; extends systems/aws-organizations, systems/aws-lambda, systems/dynamodb, systems/aws-iam, concepts/tenant-isolation (the architectural opposite of Convera's in-account multi-layer shape). → sources/2026-02-25-aws-6000-accounts-three-people-one-platform
-
2026-02-05 — AWS Architecture Blog, How Convera built fine-grained API authorization with Amazon Verified Permissions. Cross-border-payments platform rolling out Amazon Verified Permissions (managed Cedar engine) across four authorization flows on a single shared Lambda-authorizer + API Gateway shape: (1) customer UI + API, (2) internal customer-service apps federated from Okta through Cognito, (3) service-to-service machine-to-machine (patterns/machine-to-machine-authz via Cognito client-credentials), and (4) multi-tenant SaaS via per-tenant policy stores with DynamoDB
tenant_id → policy-store-idmapping + backend zero-trust re-verification + RDS-side tenant-context enforcement as the last line of defense. Attribute sourcing via pre-token-generation Lambda hook (concepts/token-enrichment) reads roles from RDS (customer flow) or attributes from DynamoDB (internal / multi-tenant flows) and signs them into the Cognito access token; downstream authorizer evaluates Cedar policies against JWT claims only — no second round-trip. concepts/policy-as-data governance: Cedar policies live in DynamoDB (source of truth) + DynamoDB Streams sync pipeline continuously propagates changes into AVP; authorship gated by a strictly-regulated IAM role owned by infosec. Submillisecond end-to-end latency is a product of a two-level cache — API Gateway (authorizer-decision cache) plus app-level Cognito token cache — not AVP alone (AVP is "millisecond-level"). Reported outcomes: thousands of authorization requests per second, submillisecond latency, ~60% reduction in time spent on access-management tasks. Subtle correctness property called out: the same Cedar policy must be evaluated at both the UI level (to gate affordance visibility) and the API level (to gate enforcement) — skipping the API-side check on the assumption that the UI is the only client is the canonical fine-grained-auth anti-pattern. Marketing-leaning AWS Architecture Blog format — architectural signal is dense (the policy-store-per-tenant tradeoff enumeration, the zero-trust re-verification step, the DynamoDB-Streams policy sync, the access-token-vs-ID-token distinction, the one-shape-four-flows reuse) but with no latency distribution, no cost baseline, no Cedar policy volume, no incident postmortem, and no discussion of policy-store resource-quota limits. Introduces systems/amazon-verified-permissions, systems/cedar, systems/amazon-cognito, systems/amazon-api-gateway, systems/okta; concepts/fine-grained-authorization, concepts/attribute-based-access-control, concepts/policy-as-data, concepts/tenant-isolation, concepts/zero-trust-authorization, concepts/authorization-decision-caching, concepts/token-enrichment; patterns/lambda-authorizer, patterns/per-tenant-policy-store, patterns/pre-token-generation-hook, patterns/zero-trust-re-verification, patterns/machine-to-machine-authz; extends systems/aws-iam, systems/aws-lambda, systems/dynamodb, systems/aws-rds, systems/aws-eks, systems/aws-policy-interpreter, concepts/least-privileged-access. → sources/2026-02-05-aws-convera-verified-permissions-fine-grained-authorization -
2026-02-04 — AWS Architecture Blog, Mastering millisecond latency and millions of events: The event-driven architecture behind the Amazon Key Suite. Amazon Key team's retrospective on modernising their access-management platform from a tightly-coupled monolithic design (+ ad-hoc SNS/SQS pairs) to EventBridge-centric event-driven architecture. Core organisational move is patterns/single-bus-multi-account (AWS reference pattern): central DevOps-owned event bus + rules + targets, per-service-team accounts owning application stacks, logical separation via rules. Three custom components built on top of EventBridge to close its gaps: a custom schema repository (JSON-Schema Draft-04; versioned; build-time code bindings; chosen because EventBridge has a schema registry but no native validation — "EventBridge provides developers with tools to implement validation using external solutions or custom application code, it currently does not include native schema validation capabilities"); a client library with client-side validation (evaluated vs centralized validation service + rejected on extra network hop + own-scaling overhead) handling code bindings + pre- publish validation + serde + publish/subscribe abstractions; and a CDK subscriber constructs library (patterns/reusable-subscriber-constructs) provisioning per- subscriber event bus + cross-account IAM + monitoring + alerting from ~5 lines. Named pre-migration failure modes: cross-service cascade deadlocks ("an issue in Service-A triggered a cascade of failures across many upstream services, with increased timeouts leading to retry attempts and ultimately resulting in service deadlocks") + single-device-vendor fleet-wide blast radius + loose schemas blocking safe schema evolution + ad-hoc SNS/SQS pairs without standardisation. Reported post-migration numbers: 2,000 events/s, 99.99% success rate, 80ms p90 ingestion→target invocation across 14M subscriber calls, integration time for new use cases 5d → 1d (80% improvement), new-event onboarding 48h → 4h, publisher/subscriber integration 40h → 8h, standardized client library addressed 90% of common integration errors, 100% single-control-plane governance of event-bus infrastructure, 100% automated unauthorized-data-exchange detection. Marketing-leaning AWS-Architecture-Blog format — architectural-signal dense (ownership split + schema-repository vs registry distinction + CDK construct shape + the specific failure modes that motivated the move) but no comparison baselines / distribution shapes / cost breakdown / DLQ-and-poison-pill design. Introduces systems/amazon-eventbridge, systems/amazon-key, systems/aws-cdk; concepts/event-driven-architecture, concepts/service-coupling, concepts/schema-registry; patterns/single-bus-multi-account, patterns/client-side-schema-validation, patterns/reusable-subscriber-constructs; extends systems/aws-sns, systems/aws-sqs. → sources/2026-02-04-aws-amazon-key-eventbridge-event-driven-architecture
-
2026-01-30 — AWS Architecture Blog, Sovereign failover: Design for digital sovereignty using the AWS European Sovereign Cloud. Architectural companion to the skipped 2026-01-16 AWS European Sovereign Cloud GA launch announcement. Codifies cross-partition failover as the response to human-driven disasters (regulatory / geopolitical / sovereignty shifts that regional redundancy inside a single partition cannot address). Names the four AWS partitions (standard
aws/ GovCloudaws-us-govsince 2011 / AWS Chinaaws-cn/ European Sovereign Cloudaws-euscsince 2026) and the three hard-boundary consequences: IAM credentials don't carry, S3 Cross-Region Replication / Transit Gateway inter-region peering / other cross-region primitives don't work across partitions, and service availability differs per partition. Applies the canonical backup / pilot-light / warm- standby / multi-site active-active DR ladder to the partition axis, with pilot-light the recommended cross-partition default ("only built up when needed"). Names exactly three cross-partition network-connectivity options: internet-over-TLS, IPsec Site-to- Site VPN, and Direct Connect gateway / PoP-to-PoP partner connections. Enumerates five cross-partition auth tactics (IAM roles with trust + external IDs, STS regional endpoints, resource-based policies, cross-account roles via Organizations, and federation from a centralized IdP — called out as modern best practice: patterns/centralized-identity-federation); IAM-user fallback uses Secrets Manager + Lambda + backup-user-for-availability. Introduces "double- signed certificates" — per-partition Private CA root CAs cross-sign each other to enable cross-partition authenticated mTLS while preserving partition isolation; operational complexity (cross-signing agreements, trust-store management, validation / revocation, audit trails) is named. Prescribed Organizations topology: completely separate Organization mandatory for European Sovereign Cloud; paired-optional for GovCloud (but separate still recommended if sovereign-standalone is the goal). Per-partition isolated Transit Gateways / separate Route 53 zones / PrivateLink for secure cross- partition communication; per-partition Config aggregators and Security Hub instances; Control Tower manages commercial side but cannot directly manage GovCloud or European Sovereign Cloud accounts. Vendor-independence against geopolitical risk framed as cheaper via cross-partition (IaC reuse) than cross-cloud. Design-pattern article; no RTO / RPO / cost / latency numbers, no data-synchronization recipe (custom tooling only), no partition-internal architecture detail. Introduces concepts/aws-partition, concepts/digital-sovereignty, concepts/disaster-recovery-tiers, concepts/cross-partition-authentication, concepts/cross-signed-certificate-trust; patterns/cross-partition-failover, patterns/pilot-light-deployment, patterns/warm-standby-deployment, patterns/centralized-identity-federation; systems/aws-european-sovereign-cloud, systems/aws-govcloud, systems/aws-iam, systems/aws-sts, systems/aws-organizations, systems/aws-control-tower, systems/aws-direct-connect, systems/aws-transit-gateway, systems/aws-privatelink, systems/aws-config, systems/aws-security-hub, systems/aws-secrets-manager. → sources/2026-01-30-aws-sovereign-failover-design-digital-sovereignty -
2026-01-12 — AWS Architecture Blog, How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters. Customer-case-study of Salesforce's mid-2025 → early-2026 migration of 1,000+ EKS clusters / 1,180+ node pools / thousands of internal tenants from the Kubernetes Cluster Autoscaler + Auto Scaling groups to Karpenter. Canonical wiki reference for Karpenter at extreme scale. Motivations: thousands of rigid node groups, multi-minute scaling latency, poor AZ balance
-
memory-workload performance bottlenecks, inefficient bin-packing with stranded capacity. Tooling: in-house Karpenter transition tool with three first-class design principles — zero-disruption (PDB-respecting drain), rollback- capable (reverse-transition to ASG first-class), CI/CD-integrated; plus a Karpenter patching check tool for AMI validation. Config translation: automated mapping from ASG fields (instance type, root-volume size/IOPS/type/ throughput, node labels) to Karpenter's
NodePool/EC2NodeClassacross all 1,180+ pools. Rollout: phased with soak times under risk-based sequencing (low-risk environments first, prod last). Five operational lessons — each a generalisable principle: (1) PDB hygiene as governance — overly restrictive / misconfigured PDBs block node replacement; fix with audit + app-owner partnership + OPA admission-time validation; (2) sequential node cordoning with verification checkpoints beats parallel — parallel destabilised clusters; (3) [[concepts/kubernetes-label-length-limit|63-character label limit]] is a migration-blocker that hides in human-friendly naming conventions (analytics-bigdata-spark-executor-pool-m6a- 32xlarge-az-a-b-c= 67 chars); (4) Singleton protection under bin-packing consolidation via guaranteed-pod-lifetime + workload- aware disruption policies (PDBs structurally can't protect 1-replica pods); (5) 1:1 ephemeral-storage translation — not defaulting — required for I/O-intensive workloads. Outcomes: scaling latency minutes → seconds; 80% manual-ops reduction; 5% FY2026 cost savings with another 5-10% projected FY2027; eliminated thousands of node groups; heterogeneous GPU / ARM / x86 in singleNodePool; improved IP efficiency via subnet-decoupled provisioning; true self-service infrastructure via developer-authoredNodePoolCRDs. Industry context: Datadog reports +22% Karpenter-provisioned node share in the last 2 years. Introduces systems/salesforce, systems/cluster-autoscaler, systems/aws-auto-scaling-groups; concepts/scaling-latency, concepts/singleton-workload, concepts/availability-zone-balance, concepts/ip-address-fragmentation, concepts/kubernetes-label-length-limit, concepts/self-service-infrastructure; patterns/automated-configuration-mapping, patterns/phased-migration-with-soak-times, patterns/rollback-capable-migration-tool, patterns/sequential-node-cordoning, patterns/risk-based-sequencing; extends systems/karpenter (canonical largest-scale production reference), systems/aws-eks (1,000-cluster column in the role-axis table), systems/open-policy-agent (OPA as operational-correctness enforcement, not just security), concepts/bin-packing (solver-side at production scale), concepts/pod-disruption-budget (OPA-enforced admission governance pattern), patterns/disruption-budget-guarded-upgrades (customer-managed-autoscaler variant alongside the Generali managed-data-plane variant). → sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters -
2025-12-11 — AWS Architecture Blog, Architecting conversational observability for cloud applications. Reference architecture for a generative-AI-powered Kubernetes troubleshooting assistant on EKS, with a companion GitHub sample. Canonical wiki companion to the later 2026-03-18 AWS DevOps Agent product launch — this is the self-build blueprint, that is the AWS-managed-service shape, of the same problem. Core architectural content: (1) two deployment options selected by a single Terraform variable — a RAG-based chatbot (default) and a Strands Agents + MCP variant; (2) telemetry-to-RAG pipeline: Fluent Bit DaemonSet → Kinesis Data Streams buffer → Lambda normalize + batch-embed via Bedrock's
amazon.titan-embed-text-v2:0→ OpenSearch Serverless k-NN index (hot tier); the Strands variant stores 1024-dim embeddings in S3 Vectors instead (cold tier); (3) an in-cluster troubleshooting assistant pod with a read-only RBAC service account and a static kubectl allowlist — canonical read-only agent action allowlisting; (4) an iterative agentic troubleshooting loop — retrieve → LLM proposes kubectl → assistant runs → output back to LLM → LLM decides continue or conclude — combining historical telemetry + real-time cluster state in one context; (5) the agentic variant uses three specialized agents (Agent Orchestrator + Memory Agent + K8s Specialist — patterns/specialized-agent-decomposition) calling EKS MCP Server over MCP; Slack bot as the UI; Pod Identity for AWS service access; (6) security discipline — sanitize logs before embedding (vectors inherit source data governance), KMS encryption Kinesis-in-transit + OpenSearch-at-rest, private subnets + VPC endpoints + prompt- injection input validation. MTTR framing cites 2024 Observability Pulse Report: 48% of orgs name team- knowledge gaps as their biggest observability challenge, 82% say issue resolution takes >1h. Explicit "Pro tip": Lambda should batch Kinesis consumption + embedding generation -
OpenSearch writes for cost. No evaluation / MTTR-delta / cost / prompt-injection-resistance numbers disclosed — architecture + sample repo only. Post asserts the approach extends to ECS and Lambda but only EKS is demonstrated. Introduces systems/strands-agents-sdk, systems/eks-mcp-server, systems/fluent-bit, systems/amazon-kinesis-data-streams; concepts/agentic-troubleshooting-loop; patterns/allowlisted-read-only-agent-actions, patterns/telemetry-to-rag-pipeline; extends systems/aws-eks (AI-troubleshooting target — self-build variant), systems/amazon-bedrock (embedding model + LLM substrate), systems/amazon-opensearch-service (hot-tier telemetry vector store), systems/s3-vectors (cold-tier telemetry vector store in Strands variant), systems/amazon-titan-embeddings (telemetry domain), systems/aws-lambda (telemetry-to-RAG compute tier), systems/model-context-protocol (cluster-operations tool surface), systems/aws-devops-agent (managed-service sibling), concepts/observability (self-build shape under agent-assisted debugging layer), patterns/specialized-agent-decomposition (three-agent Kubernetes split). → sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications
-
2025-07-16 — AWS News Blog, Introducing Amazon S3 Vectors: First cloud storage with native vector support at scale (preview) (Channy Yun). Launches S3 Vectors as a first-class S3 data primitive for vector similarity indices. Resource model: vector bucket → vector index → vectors + metadata; launch limits 10,000 indexes/bucket × tens-of-millions vectors/index; Cosine or Euclidean distance (per index);
float32data; key-value metadata usable as query filters. Separates3vectorsAPI client (put_vectors,query_vectors, ...). SSE-S3 (default) or SSE-KMS encryption. Claims up to 90% TCO reduction vs DRAM/SSD vector-cluster storage; subsecond query performance at scale. Two integrations at launch: Bedrock Knowledge Bases selects an S3 vector bucket as the vector store for RAG apps (exposed in Bedrock console and SageMaker Unified Studio), and a one-click Advanced search export → Export to OpenSearch flow migrates a vector index to an OpenSearch Serverless k-NN collection — canonical patterns/cold-to-hot-vector-tiering (AWS's stated hot uses: "product recommendations or fraud detection"). Paved embedding path: Amazon Titan Text Embeddings V2 viabedrock.invoke_model. Launch regions: IAD, CMH, PDX, FRA, SYD. Internal index architecture (HNSW / IVF / hybrid), concrete latency numbers, filter-with-ANN semantics, and competitive recall comparisons are not disclosed. Marketing-leaning post; primary architectural signal is API shape + capacity ceiling + tiering story. Introduces concepts/vector-embedding, concepts/vector-similarity-search, and concepts/hybrid-vector-tiering to the wiki. → sources/2025-07-16-aws-amazon-s3-vectors-preview-launch -
2025-05-03 — AWS Database Blog, Understanding transaction visibility in PostgreSQL clusters with read replicas (Sergey Melnik). AWS's response to Jepsen's 2025-04-29 report on Amazon RDS for PostgreSQL Multi-AZ cluster transaction- visibility behavior. Confirms Jepsen's empirical finding but re-situates the anomaly as inherent to community PostgreSQL (discussed on pgsql-hackers since 2013), not RDS-specific. Mechanism: Postgres's commit path writes the WAL commit record (durable) then asynchronously removes the xid from the in-memory
ProcArray(visible); two concurrent non-conflicting commits can flipProcArrayremoval order relative to WAL LSN, admitting the Long Fork anomaly — a violation of concepts/snapshot-isolation's atomic-visibility property (concepts/visibility-order-vs-commit-order). Affects all Postgres isolation levels (Read Committed / Repeatable Read / Serializable) because all take snapshots viaProcArray. Absent in Single-AZ Postgres, systems/aurora-limitless, and systems/aurora-dsql (both replaceProcArraywith time-based MVCC via Postgres-extension surgery — see patterns/postgres-extension-over-fork). Worked Alice-and-Bob illustration: primary says #1, replica says #2, commit log says #2 — both observers correct under SI on their own node, jointly incompatible under formal SI. Proposed upstream fix: Commit Sequence Numbers (CSN) — stamp a monotonic CSN on each commit and snapshot by watermark comparison; multi-patch effort presented at PGConf.EU 2024; AWS PostgreSQL Contributors Team (formed 2022) participating. Practical impact on end-user apps is low (most apps serialize via row conflicts / app-level ordering) but load-bearing against five enterprise capabilities: distributed-SQL consistent snapshots, read-write splitting, snapshot-then-WAL-replay data sync, PITR to LSN, and tuple-xid-to-logical-commit-time replacement. Also names a CPU-cost angle —ProcArrayscanning is "a measurable fraction of CPU" at thousands of connections on large Postgres servers. AWS-recommended workarounds: never rely on implicit commit ordering, introduce explicit synchronization (shared counters, timestamps, database constraints). Vendor response to a third-party analysis; no throughput/latency/cost numbers beyond the CPU-fraction qualitative claim. → sources/2025-05-03-aws-postgresql-transaction-visibility-read-replicas -
2025-01-18 — AWS Containers Blog, Migrating from AWS App Mesh to Amazon ECS Service Connect (also doubles as the AWS App Mesh discontinuation announcement). Canonical EOL-migration article: App Mesh closed to new customers 2024-09-24, fully discontinued 2026-09-30. Architectural comparison between App Mesh's four-tier Envoy-sidecar abstraction (Mesh / Virtual Service / Virtual Router / Virtual Node + self-managed systems/envoy sidecar per ECS Task) and Service Connect's flat Client/Server role model + single Cloud Map namespace + AWS-managed Service Connect Proxy. The load-bearing architectural shift is concepts/managed-data-plane — same Envoy, different operational contract. Explicit 5-feature delta: retry/outlier tuning (full vs timeouts-only), version-weighted routing (yes vs no), observability (DIY vs free CloudWatch), mTLS (yes vs not yet), cross-account mesh sharing (AWS RAM vs single-account only). EKS customers directed to Amazon VPC Lattice separately (not a sidecar mesh). Prescribes patterns/blue-green-service-mesh-migration with Route 53 weighted records / CloudFront continuous deployment / ALB multi-target-group — because an ECS Service cannot be in both meshes simultaneously and the two meshes have no cross-environment networking. No quantitative numbers (architecture + migration-pattern post, not retrospective). → sources/2025-01-18-aws-app-mesh-discontinuation-service-connect-migration
-
2024-12-04 — AWS News Blog, Prevent factual errors from LLM hallucinations with mathematically sound Automated Reasoning checks (preview) (Antje Barth). Preview-launch of systems/bedrock-guardrails-automated-reasoning-checks in US West (Oregon): the first productized AWS neurosymbolic safeguard. First public disclosure of the end-to-end concepts/autoformalization pipeline (document upload → identify concepts → decompose units → translate to formal logic → validate → combine into logical model → review rules + typed variables in UI). Three-verdict validation output (Valid / Invalid / Mixed results) with structured suggestions (unstated-assumptions vs variable-assignments). Canonical regenerate-with-feedback Python snippet — the reasoner's rule descriptions are wrapped in
<feedback>tags and re-prompted to the LLM (natural language, never the formal form). Variable-description-as-tuning-knob (is_full_timeworked example). Inventories AWS's pre-LLM automated-reasoning portfolio across five service areas: storage, networking, virtualization, identity, cryptography. Positioned as complementary to prompt engineering / RAG / contextual grounding — only major-cloud safeguard combining safety + privacy + truthfulness. The concrete substrate for the more- abstract 2026-02 Cook thesis piece. → sources/2024-12-04-aws-automated-reasoning-to-remove-llm-hallucinations -
2024-07-29 — AWS Open Source Blog, Amazon's Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2. Amazon Retail BDT's multi-year migration of their copy-on-write compactor off Spark (on EMR) onto a hand-crafted Ray application on EC2. Q1 2024: 1.5 EiB Parquet input, 4 EiB Arrow in-memory, >10k vCPU-years/quarter, 82% better cost efficiency per GiB vs Spark, 100% on-time delivery, Ray still trailing Spark on first-time reliability (99.15% vs 99.91%). Contributed The Flash Compactor to Ray's systems/deltacat. ~$120M/yr typical-EC2- customer-equivalent annual saving. Open-source extensions target Iceberg/Hudi/Delta. +24% additional win via Daft-on-Ray I/O. → sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2