SYSTEM Cited by 8 sources
AWS EKS (Elastic Kubernetes Service)¶
Amazon EKS (Elastic Kubernetes Service) is AWS's managed Kubernetes control plane — AWS runs the API server, etcd, and the core controllers; customers run worker nodes (on EC2 or Fargate) and their own workloads.
From a system-design posture, EKS is the managed-control-plane equivalent of self-hosted Kubernetes with the same data-plane abstractions (Pods, Services, StatefulSets, Ingress, etc.), the same xDS / API / Helm ecosystem, and the same CNCF toolbox (systems/karpenter, systems/keda, systems/envoy, systems/kyverno, systems/cilium).
Stub page — minimal viable for the Figma ECS→EKS migration ingest. Expand on future EKS-internals sources.
Contrast with ECS¶
Figma's 2024 migration post enumerates the EKS advantages that drove their ECS→EKS cutover:
- StatefulSets for stateful workloads — Kubernetes primitive that gives pods stable network identities across restarts. ECS doesn't have this; Figma had written custom container-startup code to dynamically update etcd cluster membership, which was "fragile and hard to maintain." StatefulSets is the standard way to run etcd on Kubernetes.
- Helm charts ecosystem — easy install / upgrade of OSS software (Figma specifically called out systems/temporal). On ECS, the equivalent required hand-porting each service to systems/terraform.
- Graceful node cordon-and-drain. Cordoning a bad EC2 node on EKS lets the API server move pods off respecting shutdown hooks. ECS on EC2 has no equivalent.
- CNCF auto-scaling — systems/keda for pod-level (with custom metrics like SQS queue depth), systems/karpenter for node-level. ECS has some auto-scaling but the CNCF offerings are more flexible.
- Service-mesh availability — Istio (Envoy-based) is trivial to adopt on EKS; on ECS, building equivalent functionality (custom filters, mTLS) would require building in-house what Istio ships.
- Vendor-agnostic user base drives more external investment than ECS (AWS-only).
Seen in¶
- sources/2024-08-08-figma-migrated-onto-k8s-in-less-than-12-months — Figma's target platform. Three active EKS clusters per environment receive real traffic for every service — patterns/multi-cluster-active-active-redundancy reduces the blast radius of cluster-scoped incidents (like the CoreDNS destruction they describe) to ~1/3 of traffic.
- sources/2026-02-05-aws-convera-verified-permissions-fine-grained-authorization
— EKS as the backend compute tier in Convera's multi-tenant
SaaS flow: API Gateway forwards requests to Kubernetes pods with
tenant_idin a custom header; each pod re-validates with AVP against the tenant's policy store before building a tenant context and forwarding to RDS. EKS pods are the site of the second authorization check in Convera's zero-trust chain. - sources/2026-02-26-aws-santander-catalyst-platform-engineering — EKS as the internal-developer-platform control plane cluster in Santander Catalyst — "the brain of the operation, orchestrating all components and workflows". One EKS cluster hosts three load-bearing sub-components: ArgoCD for data-plane claims (GitOps), OPA Gatekeeper for the policies catalog (patterns/policy-gate-on-provisioning), and Crossplane for the stacks catalog (patterns/crossplane-composition). This is EKS used as an infrastructure control plane, not an application compute tier — a fundamentally different role from Figma (app compute) or Convera (backend zero-trust tier), and the canonical wiki instance of EKS as the substrate for a multi-cloud internal developer platform.
- sources/2026-03-18-aws-ai-powered-event-response-for-amazon-eks — EKS as the investigation target of AWS DevOps Agent. One Agent Space per EKS cluster; the agent combines a Kubernetes API resource scan (the graph nodes: Pods / Deployments / Services / ConfigMaps / Ingress / NetworkPolicies with their metadata, resource specs, and health checks) with OpenTelemetry-derived runtime relationships (the graph edges: service-mesh traffic, distributed traces, metric attribution) into a unified dependency graph used for root-cause analysis. See concepts/telemetry-based-resource-discovery for the discovery methodology and systems/aws-devops-agent for the full investigation workflow.
- sources/2026-03-23-aws-generali-malaysia-eks-auto-mode — EKS in its Auto Mode variant at Generali Malaysia: AWS operates the K8s data plane as well (Bottlerocket AMI on a weekly-replacement cadence, default add-ons, cluster-version upgrades). Canonical wiki reference for the peer-AWS-service integration surface of EKS — the case study documents six managed services plugged into one cluster: GuardDuty (threat detection), Inspector (vuln scanning with ECR-to- running-container mapping), Network Firewall (SNI egress allow-list), Secrets Manager
- External Secrets Operator (env-var secret injection, no volume mounts), Amazon Managed Grafana (per-namespace dashboards), and AWS Billing's split cost allocation data for EKS (patterns/eks-cost-allocation-tags). Compound operating discipline: stateless-only pods + immutable pods + Helm-as-standard-packaging + HPA auto-scaling. Customer-retained safety contract under Auto Mode's platform-driven node churn: Pod Disruption Budgets + Node Disruption Budgets + off-peak maintenance window. See systems/generali-malaysia-eks for the full platform synthesis.
- sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod
— EKS as the Kubernetes control plane under SageMaker
HyperPod inference, and (more generally) as the packaging
substrate for the EKS add-on primitive. 2026-04-06
repackaging of the
HyperPod Inference Operator from Helm chart to native EKS
add-on is the canonical wiki instance of
patterns/eks-add-on-as-lifecycle-packaging — four
dependency add-ons bundled (cert-manager, S3 Mountpoint CSI,
FSx CSI, metrics-server), four IAM roles scaffolded (execution,
JumpStart gated, ALB controller, KEDA), migration script
(
helm_to_addon.sh) with auto-discovery + OVERWRITE install + rollback semantics. Highlights the EKS add-on API (aws eks create-addon --configuration-values) as a managed-lifecycle packaging primitive that sits alongside Helm as a distribution path for AWS-authored K8s operators. - sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters
— EKS as the 1,000-cluster / 1,180-node-pool production
platform at Salesforce — the largest
documented EKS fleet in the wiki. Canonical wiki reference for
EKS-at-extreme-scale operations: Karpenter
migration off Cluster Autoscaler +
ASGs with in-house
transition tool (zero-disruption + PDB-respecting drain +
rollback-to-ASG + CI/CD-integrated); automated ASG→
NodePool/EC2NodeClassconfig mapping over 1,180+ node pools; the five generalisable operational lessons ( PDB hygiene with OPA-enforced admission, sequential node cordoning with verification checkpoints, [[concepts/kubernetes-label-length-limit|63-character label limit]] as migration-blocker, singleton-workload protection under bin-packing consolidation, 1:1 ephemeral-storage translation). Outcome metrics: scaling latency minutes → seconds; 80% manual-ops reduction; 5% FY2026 cost savings (+5-10% projected for FY2027); eliminated thousands of node groups; heterogeneous GPU / ARM / x86 in single node pools. Rollout: mid-2025 → early 2026, phased with soak times under risk-based sequencing. - sources/2025-12-11-aws-architecting-conversational-observability-for-cloud-applications
— EKS as the investigation target in a self-built AI
troubleshooting blueprint, companion to the later AWS-managed
DevOps Agent post. Same target, different vendor relationship:
a customer-built RAG chatbot (Fluent Bit → Kinesis → Lambda +
Bedrock embeddings → OpenSearch Serverless) or Strands-based
agent system (with EKS MCP Server
for cluster operations) investigates an EKS cluster via an
in-cluster troubleshooting assistant pod running with a
read-only RBAC service account and a static kubectl allowlist
(patterns/allowlisted-read-only-agent-actions). Combined
stored telemetry (patterns/telemetry-to-rag-pipeline) + live
kubectloutput drives an iterative LLM ↔ cluster loop until the LLM judges enough context for resolution. Framing asserts ECS and Lambda as equally valid fabrics for the same approach, though only EKS is demonstrated.
EKS's role axis across ingested sources¶
Same platform, substantially different roles per case study:
| Customer | EKS's role |
|---|---|
| Figma | Application compute tier (multi-cluster active-active) |
| Convera | Backend zero-trust compute tier (per-pod AVP reval) |
| Santander | Infrastructure control plane (ArgoCD + OPA + Crossplane) |
| Generali | Multi-tenant app compute tier under Auto Mode |
| SageMaker HyperPod | LLM-inference-platform substrate (EKS add-on packaging) |
| Conversational observability blueprint | AI-troubleshooting target (self-built RAG + MCP variants) |
| AWS DevOps Agent | AI-troubleshooting target (AWS-managed variant) |
| Salesforce | Extreme-scale multi-tenant platform (1,000+ clusters, 1,180+ node pools, Karpenter-driven) |
This spread is what makes EKS a load-bearing canonical node in the wiki — the same primitive reappears in very different architectures.
Related¶
- systems/kubernetes — what EKS runs
- systems/eks-auto-mode — the managed-data-plane variant
- systems/bottlerocket — the AMI under Auto Mode
- systems/amazon-ecs — the AWS orchestrator EKS is compared with
- systems/karpenter, systems/keda — the CNCF auto-scaling projects that motivated Figma's migration
- systems/crossplane, systems/argocd, systems/open-policy-agent — the CNCF trio Catalyst runs on top of EKS to turn it into an internal-developer-platform control plane
- systems/amazon-guardduty, systems/amazon-inspector, systems/aws-network-firewall, systems/external-secrets-operator, systems/amazon-managed-grafana — the peer AWS services documented as integration surface at Generali