SYSTEM Cited by 9 sources
AWS EKS (Elastic Kubernetes Service)¶
Amazon EKS (Elastic Kubernetes Service) is AWS's managed Kubernetes control plane — AWS runs the API server, etcd, and the core controllers; customers run worker nodes (on EC2 or Fargate) and their own workloads.
From a system-design posture, EKS is the managed-control-plane equivalent of self-hosted Kubernetes with the same data-plane abstractions (Pods, Services, StatefulSets, Ingress, etc.), the same xDS / API / Helm ecosystem, and the same CNCF toolbox (systems/karpenter, systems/keda, systems/envoy, systems/kyverno, systems/cilium).
Stub page — minimal viable for the Figma ECS→EKS migration ingest. Expand on future EKS-internals sources.
Contrast with ECS¶
Figma's 2024 migration post enumerates the EKS advantages that drove their ECS→EKS cutover:
- StatefulSets for stateful workloads — Kubernetes primitive that gives pods stable network identities across restarts. ECS doesn't have this; Figma had written custom container-startup code to dynamically update etcd cluster membership, which was "fragile and hard to maintain." StatefulSets is the standard way to run etcd on Kubernetes.
- Helm charts ecosystem — easy install / upgrade of OSS software (Figma specifically called out systems/temporal). On ECS, the equivalent required hand-porting each service to systems/terraform.
- Graceful node cordon-and-drain. Cordoning a bad EC2 node on EKS lets the API server move pods off respecting shutdown hooks. ECS on EC2 has no equivalent.
- CNCF auto-scaling — systems/keda for pod-level (with custom metrics like SQS queue depth), systems/karpenter for node-level. ECS has some auto-scaling but the CNCF offerings are more flexible.
- Service-mesh availability — Istio (Envoy-based) is trivial to adopt on EKS; on ECS, building equivalent functionality (custom filters, mTLS) would require building in-house what Istio ships.
- Vendor-agnostic user base drives more external investment than ECS (AWS-only).
Seen in¶
- sources/2024-08-08-figma-migrated-onto-k8s-in-less-than-12-months — Figma's target platform. Three active EKS clusters per environment receive real traffic for every service — patterns/multi-cluster-active-active-redundancy reduces the blast radius of cluster-scoped incidents (like the CoreDNS destruction they describe) to ~1/3 of traffic.
- sources/2026-02-05-aws-convera-verified-permissions-fine-grained-authorization
— EKS as the backend compute tier in Convera's multi-tenant
SaaS flow: API Gateway forwards requests to Kubernetes pods with
tenant_idin a custom header; each pod re-validates with AVP against the tenant's policy store before building a tenant context and forwarding to RDS. EKS pods are the site of the second authorization check in Convera's zero-trust chain. - sources/2026-02-26-aws-santander-catalyst-platform-engineering — EKS as the internal-developer-platform control plane cluster in Santander Catalyst — "the brain of the operation, orchestrating all components and workflows". One EKS cluster hosts three load-bearing sub-components: ArgoCD for data-plane claims (GitOps), OPA Gatekeeper for the policies catalog (patterns/policy-gate-on-provisioning), and Crossplane for the stacks catalog (patterns/crossplane-composition). This is EKS used as an infrastructure control plane, not an application compute tier — a fundamentally different role from Figma (app compute) or Convera (backend zero-trust tier), and the canonical wiki instance of EKS as the substrate for a multi-cloud internal developer platform.
- sources/2026-03-18-aws-ai-powered-event-response-for-amazon-eks — EKS as the investigation target of AWS DevOps Agent. One Agent Space per EKS cluster; the agent combines a Kubernetes API resource scan (the graph nodes: Pods / Deployments / Services / ConfigMaps / Ingress / NetworkPolicies with their metadata, resource specs, and health checks) with OpenTelemetry-derived runtime relationships (the graph edges: service-mesh traffic, distributed traces, metric attribution) into a unified dependency graph used for root-cause analysis. See concepts/telemetry-based-resource-discovery for the discovery methodology and systems/aws-devops-agent for the full investigation workflow.
- sources/2026-03-23-aws-generali-malaysia-eks-auto-mode — EKS in its Auto Mode variant at Generali Malaysia: AWS operates the K8s data plane as well (Bottlerocket AMI on a weekly-replacement cadence, default add-ons, cluster-version upgrades). Canonical wiki reference for the peer-AWS-service integration surface of EKS — the case study documents six managed services plugged into one cluster: GuardDuty (threat detection), Inspector (vuln scanning with ECR-to- running-container mapping), Network Firewall (SNI egress allow-list), Secrets Manager
- External Secrets Operator (env-var secret injection, no volume mounts), Amazon Managed Grafana (per-namespace dashboards), and AWS Billing's split cost allocation data for EKS (patterns/eks-cost-allocation-tags). Compound operating discipline: stateless-only pods + immutable pods + Helm-as-standard-packaging + HPA auto-scaling. Customer-retained safety contract under Auto Mode's platform-driven node churn: Pod Disruption Budgets + Node Disruption Budgets + off-peak maintenance window. See systems/generali-malaysia-eks for the full platform synthesis.
- sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod
— EKS as the Kubernetes control plane under SageMaker
HyperPod inference, and (more generally) as the packaging
substrate for the EKS add-on primitive. 2026-04-06
repackaging of the
HyperPod Inference Operator from Helm chart to native EKS
add-on is the canonical wiki instance of
patterns/eks-add-on-as-lifecycle-packaging — four
dependency add-ons bundled (cert-manager, S3 Mountpoint CSI,
FSx CSI, metrics-server), four IAM roles scaffolded (execution,
JumpStart gated, ALB controller, KEDA), migration script
(
helm_to_addon.sh) with auto-discovery + OVERWRITE install + rollback semantics. Highlights the EKS add-on API (aws eks create-addon --configuration-values) as a managed-lifecycle packaging primitive that sits alongside Helm as a distribution path for AWS-authored K8s operators. - sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters
— EKS as the 1,000-cluster / 1,180-node-pool production
platform at Salesforce — the largest
documented EKS fleet in the wiki. Canonical wiki reference for
EKS-at-extreme-scale operations: Karpenter
migration off Cluster Autoscaler +
ASGs with in-house
transition tool (zero-disruption + PDB-respecting drain +
rollback-to-ASG + CI/CD-integrated); automated ASG→
NodePool/EC2NodeClassconfig mapping over 1,180+ node pools; the five generalisable operational lessons ( PDB hygiene with OPA-enforced admission, sequential node cordoning with verification checkpoints, [[concepts/kubernetes-label-length-limit|63-character label limit]] as migration-blocker, singleton-workload protection under bin-packing consolidation, 1:1 ephemeral-storage translation). Outcome metrics: scaling latency minutes → seconds; 80% manual-ops reduction; 5% FY2026 cost savings (+5-10% projected for FY2027); eliminated thousands of node groups; heterogeneous GPU / ARM / x86 in single node pools. Rollout: mid-2025 → early 2026, phased with soak times under risk-based sequencing. -
sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications — EKS as the investigation target in a self-built AI troubleshooting blueprint, companion to the later AWS-managed DevOps Agent post. Same target, different vendor relationship: a customer-built RAG chatbot (Fluent Bit → Kinesis → Lambda + Bedrock embeddings → OpenSearch Serverless) or Strands-based agent system (with EKS MCP Server for cluster operations) investigates an EKS cluster via an in-cluster troubleshooting assistant pod running with a read-only RBAC service account and a static kubectl allowlist (patterns/allowlisted-read-only-agent-actions). Combined stored telemetry (patterns/telemetry-to-rag-pipeline) + live
kubectloutput drives an iterative LLM ↔ cluster loop until the LLM judges enough context for resolution. Framing asserts ECS and Lambda as equally valid fabrics for the same approach, though only EKS is demonstrated. -
sources/2026-04-27-aws-deloitte-optimizes-eks-environment-provisioning-with-vcluster — EKS as the shared-host-cluster substrate for 50+ virtual Kubernetes clusters, a role not previously canonicalised on this page. Deloitte runs one EKS cluster with EKS Auto Mode as the host for vCluster-partitioned QA testing environments, each virtual cluster acting like an independent K8s environment while sharing the host's compute + controllers + monitoring. Environment provisioning dropped from 45 min to <5 min (89% reduction), >50 vCPU + >200 GB RAM saved at peak from non-duplicated shared controllers, up to 70% additional savings from EC2 Spot via Auto Mode. See patterns/shared-host-cluster-with-virtual-clusters for the topology and patterns/shared-alb-path-based-multi-cluster-routing for the companion ingress design that collapses 50+ ALBs into one. This is the first wiki canonical instance of EKS as a vcluster host — contrast with Generali's "EKS Auto Mode as stateless app compute" and Salesforce's "1,000+ EKS clusters under Karpenter" to see the three distinct EKS-as-substrate shapes ingested to date.
EKS's role axis across ingested sources¶
Same platform, substantially different roles per case study:
| Customer | EKS's role |
|---|---|
| Figma | Application compute tier (multi-cluster active-active) |
| Convera | Backend zero-trust compute tier (per-pod AVP reval) |
| Santander | Infrastructure control plane (ArgoCD + OPA + Crossplane) |
| Generali | Multi-tenant app compute tier under Auto Mode |
| SageMaker HyperPod | LLM-inference-platform substrate (EKS add-on packaging) |
| Conversational observability blueprint | AI-troubleshooting target (self-built RAG + MCP variants) |
| AWS DevOps Agent | AI-troubleshooting target (AWS-managed variant) |
| Salesforce | Extreme-scale multi-tenant platform (1,000+ clusters, 1,180+ node pools, Karpenter-driven) |
| Deloitte | Shared host cluster for vCluster virtual clusters (50+ QA environments on 1 EKS cluster) |
This spread is what makes EKS a load-bearing canonical node in the wiki — the same primitive reappears in very different architectures.
Related¶
- systems/kubernetes — what EKS runs
- systems/eks-auto-mode — the managed-data-plane variant
- systems/bottlerocket — the AMI under Auto Mode
- systems/amazon-ecs — the AWS orchestrator EKS is compared with
- systems/karpenter, systems/keda — the CNCF auto-scaling projects that motivated Figma's migration
- systems/crossplane, systems/argocd, systems/open-policy-agent — the CNCF trio Catalyst runs on top of EKS to turn it into an internal-developer-platform control plane
- systems/amazon-guardduty, systems/amazon-inspector, systems/aws-network-firewall, systems/external-secrets-operator, systems/amazon-managed-grafana — the peer AWS services documented as integration surface at Generali