Skip to content

SYSTEM Cited by 1 source

Cluster Autoscaler

Cluster Autoscaler (CA) is the CNCF Kubernetes node-level autoscaler that scales cloud-provider node groups up or down based on unschedulable pods and under-utilised nodes. On AWS it scales Auto Scaling groups; on other clouds it scales Managed Instance Groups (GCP), VMSS (Azure), etc. CA is the predecessor generation of node autoscalers that Karpenter is displacing on AWS.

How it works

  1. Pod is unschedulable (no node has enough resources, or the affinity can't be satisfied).
  2. CA evaluates the existing node groups and picks one whose template fits the pod; asks the cloud provider (ASG) to grow it by N.
  3. The ASG launches one or more instances of the template type; they register with the control plane; the scheduler re-runs and places the pod.
  4. Periodically CA scans for under-utilised nodes and asks the ASG to terminate them.

Why it becomes the bottleneck at scale

At large fleets, CA's indirection through ASGs produces three structural problems — all called out by Salesforce's 2026-01-12 migration post:

  • Scaling latency of minutes, not seconds. CA asks the ASG to grow; ASG calls the EC2 RunInstances API; instance goes through cloud-provider launch (boot, ENI attach, userdata, kubelet register); then the scheduler can place pods. Multi-minute p99 on spikes. By contrast, Karpenter provisions directly against pending pods and skips the ASG round trip.
  • Proliferation of node groups. Each distinct workload shape (instance family × size × zone × label set) tends to become its own ASG so CA can pick a homogeneous template. Salesforce's platform had grown to thousands of node groups / 1,180+ node pools (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters).
  • Poor AZ balance + bin-packing inefficiency. ASGs don't bin-pack across instance sizes; CA's scale-down heuristics are conservative to avoid evicting stateful pods. Result: stranded capacity, degraded customer experience on memory-intensive workloads at large cluster sizes.

Contrast with Karpenter

Cluster Autoscaler Karpenter
Capacity primitive ASG (on AWS) Direct EC2 RunInstances
Scaling latency Minutes (ASG round trip) Seconds (pending-pod-driven)
Instance diversity One template per ASG Heterogeneous types per NodePool
AZ balance ASG-driven (poor) Scheduler-driven (good)
Consolidation Under-utilised-node heuristic Continuous bin-packing
Config primitive Node group / ASG NodePool + EC2NodeClass

Seen in

  • sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters — the legacy autoscaler Salesforce retired across 1,000+ EKS clusters and 1,180+ node pools. The post's motivation section is effectively a checklist of CA / ASG limitations at extreme scale: multi-minute scaling latency, thousands of rigid node groups, poor AZ balance, conservative scale-down.
Last updated · 200 distilled / 1,178 read