SYSTEM Cited by 1 source
Cluster Autoscaler¶
Cluster Autoscaler (CA) is the CNCF Kubernetes node-level autoscaler that scales cloud-provider node groups up or down based on unschedulable pods and under-utilised nodes. On AWS it scales Auto Scaling groups; on other clouds it scales Managed Instance Groups (GCP), VMSS (Azure), etc. CA is the predecessor generation of node autoscalers that Karpenter is displacing on AWS.
How it works¶
- Pod is unschedulable (no node has enough resources, or the affinity can't be satisfied).
- CA evaluates the existing node groups and picks one whose template fits the pod; asks the cloud provider (ASG) to grow it by N.
- The ASG launches one or more instances of the template type; they register with the control plane; the scheduler re-runs and places the pod.
- Periodically CA scans for under-utilised nodes and asks the ASG to terminate them.
Why it becomes the bottleneck at scale¶
At large fleets, CA's indirection through ASGs produces three structural problems — all called out by Salesforce's 2026-01-12 migration post:
- Scaling latency of minutes, not seconds. CA asks the ASG to grow; ASG calls the EC2 RunInstances API; instance goes through cloud-provider launch (boot, ENI attach, userdata, kubelet register); then the scheduler can place pods. Multi-minute p99 on spikes. By contrast, Karpenter provisions directly against pending pods and skips the ASG round trip.
- Proliferation of node groups. Each distinct workload shape (instance family × size × zone × label set) tends to become its own ASG so CA can pick a homogeneous template. Salesforce's platform had grown to thousands of node groups / 1,180+ node pools (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters).
- Poor AZ balance + bin-packing inefficiency. ASGs don't bin-pack across instance sizes; CA's scale-down heuristics are conservative to avoid evicting stateful pods. Result: stranded capacity, degraded customer experience on memory-intensive workloads at large cluster sizes.
Contrast with Karpenter¶
| Cluster Autoscaler | Karpenter | |
|---|---|---|
| Capacity primitive | ASG (on AWS) | Direct EC2 RunInstances |
| Scaling latency | Minutes (ASG round trip) | Seconds (pending-pod-driven) |
| Instance diversity | One template per ASG | Heterogeneous types per NodePool |
| AZ balance | ASG-driven (poor) | Scheduler-driven (good) |
| Consolidation | Under-utilised-node heuristic | Continuous bin-packing |
| Config primitive | Node group / ASG | NodePool + EC2NodeClass |
Seen in¶
- sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters — the legacy autoscaler Salesforce retired across 1,000+ EKS clusters and 1,180+ node pools. The post's motivation section is effectively a checklist of CA / ASG limitations at extreme scale: multi-minute scaling latency, thousands of rigid node groups, poor AZ balance, conservative scale-down.
Related¶
- systems/karpenter — the successor system
- systems/aws-auto-scaling-groups — the AWS capacity primitive CA drives
- systems/aws-eks — typical Kubernetes runtime
- systems/kubernetes — the orchestrator whose pending-pod signal CA consumes
- concepts/scaling-latency — the metric CA loses on
- concepts/bin-packing — the primitive CA does poorly