SYSTEM Cited by 1 source

Cluster Autoscaler¶

Cluster Autoscaler (CA) is the CNCF Kubernetes node-level autoscaler that scales cloud-provider node groups up or down based on unschedulable pods and under-utilised nodes. On AWS it scales Auto Scaling groups; on other clouds it scales Managed Instance Groups (GCP), VMSS (Azure), etc. CA is the predecessor generation of node autoscalers that Karpenter is displacing on AWS.

How it works¶

Pod is unschedulable (no node has enough resources, or the affinity can't be satisfied).
CA evaluates the existing node groups and picks one whose template fits the pod; asks the cloud provider (ASG) to grow it by N.
The ASG launches one or more instances of the template type; they register with the control plane; the scheduler re-runs and places the pod.
Periodically CA scans for under-utilised nodes and asks the ASG to terminate them.

Why it becomes the bottleneck at scale¶

At large fleets, CA's indirection through ASGs produces three structural problems — all called out by Salesforce's 2026-01-12 migration post:

Scaling latency of minutes, not seconds. CA asks the ASG to grow; ASG calls the EC2 RunInstances API; instance goes through cloud-provider launch (boot, ENI attach, userdata, kubelet register); then the scheduler can place pods. Multi-minute p99 on spikes. By contrast, Karpenter provisions directly against pending pods and skips the ASG round trip.
Proliferation of node groups. Each distinct workload shape (instance family × size × zone × label set) tends to become its own ASG so CA can pick a homogeneous template. Salesforce's platform had grown to thousands of node groups / 1,180+ node pools (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters).
Poor AZ balance + bin-packing inefficiency. ASGs don't bin-pack across instance sizes; CA's scale-down heuristics are conservative to avoid evicting stateful pods. Result: stranded capacity, degraded customer experience on memory-intensive workloads at large cluster sizes.

Contrast with Karpenter¶

	Cluster Autoscaler	Karpenter
Capacity primitive	ASG (on AWS)	Direct EC2 RunInstances
Scaling latency	Minutes (ASG round trip)	Seconds (pending-pod-driven)
Instance diversity	One template per ASG	Heterogeneous types per `NodePool`
AZ balance	ASG-driven (poor)	Scheduler-driven (good)
Consolidation	Under-utilised-node heuristic	Continuous bin-packing
Config primitive	Node group / ASG	`NodePool` + `EC2NodeClass`

Seen in¶

sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters — the legacy autoscaler Salesforce retired across 1,000+ EKS clusters and 1,180+ node pools. The post's motivation section is effectively a checklist of CA / ASG limitations at extreme scale: multi-minute scaling latency, thousands of rigid node groups, poor AZ balance, conservative scale-down.

systems/karpenter — the successor system
systems/aws-auto-scaling-groups — the AWS capacity primitive CA drives
systems/aws-eks — typical Kubernetes runtime
systems/kubernetes — the orchestrator whose pending-pod signal CA consumes
concepts/scaling-latency — the metric CA loses on
concepts/bin-packing — the primitive CA does poorly

Cluster Autoscaler¶

How it works¶

Why it becomes the bottleneck at scale¶

Contrast with Karpenter¶

Seen in¶

Related¶