Skip to content

AWS 2026-01-12

Read original ↗

How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters

Summary

AWS Architecture Blog case study (2026-01-12) documenting Salesforce's mid-2025→early-2026 migration of its Kubernetes platform — 1,000+ Amazon EKS clusters, 1,180+ node pools, thousands of internal tenants — from the Kubernetes Cluster Autoscaler (running over AWS Auto Scaling groups) to Karpenter, AWS's open-source pod-driven node autoscaler. The motivation was structural: the Auto Scaling group / Cluster Autoscaler combo produced multi-minute scaling latency during demand spikes, thousands of rigid node groups that slowed innovation, poor bin-packing and conservative scale-down stranding resources, and poor AZ balance + large-cluster performance bottlenecks for memory-intensive workloads. Salesforce built two in-house tools — a Karpenter transition tool (cordons + PDB-respecting drain + rollback) and a Karpenter patching check tool (AMI validation) — embedded them in the provisioning CI/CD pipeline, automated the Auto-Scaling-group-config → NodePool / EC2NodeClass mapping, and rolled out with soak times starting from the least-critical environments. Outcomes: 80% reduction in manual operational overhead, scaling latency minutes → seconds, 5% FY2026 cost savings with another projected 5-10% for FY2027, elimination of thousands of node groups, plus developer self-service and heterogeneous instance types (GPU / ARM / x86) in a single node pool. Five named operational lessons — PDB hygiene (OPA-enforced), sequential not parallel cordoning, 63-character label-length limit, singleton-pod protection under bin-packing consolidation, 1:1 ephemeral-storage mapping — are the re-usable substance of the post.

Key takeaways

  1. Auto Scaling groups become the scaling bottleneck at scale. Salesforce's pre-migration architecture of one node group per workload-shape × AZ had grown to thousands of ASGs. Cluster Autoscaler scales by asking an ASG to grow/shrink, which adds minutes of scaling latency during demand spikes; Karpenter by contrast looks at pending pods and provisions instances directly, collapsing latency to seconds. (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

  2. At 1,000+ clusters and 1,180+ node pools, manual migration is infeasible — the migration itself becomes a product. Salesforce built the Karpenter transition tool (orchestrates cordon-and-drain with PDB respect, rollback-to-ASG, CI/CD-integrated) and the Karpenter patching check tool (AMI validation). Embedding them in the core infrastructure provisioning pipeline is what made the rollout repeatable across thousands of clusters and made rollback a first-class operation — not a scramble. (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

  3. Automate the configuration mapping; don't hand-translate. Legacy ASG fields map cleanly onto Karpenter concepts: ASG instance types → EC2NodeClass instance types, root-volume sizes/IOPS/type/throughput → EC2NodeClass storage parameters, node labels → NodePool + EC2NodeClass labels. With 1,180 node pools of highly diverse config, automated mapping was essential to minimise human error. (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

  4. Pod Disruption Budgets are where Karpenter's consolidation meets application reality — treat them as a governance primitive, not a knob. Several Salesforce services had overly restrictive or misconfigured PDBs that blocked node replacements entirely. Remediation had three parts: audit the existing configurations, partner with application owners to fix them, and install OPA policies for proactive PDB validation at admission. (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

  5. Sequential, checkpointed node cordoning beats parallel cordoning. The team's initial approach — cordoning Karpenter nodes in parallel — caused unexpected cluster-health issues. They reworked it to sequential node cordoning with manual verification checkpoints (with rollback) and enhanced monitoring for early instability detection. Modern tooling doesn't remove the need for careful orchestration of node maintenance; it shifts what needs orchestrating. See patterns/sequential-node-cordoning. (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

  6. Kubernetes's concepts/kubernetes-label-length-limit|63-character label-length limit is a migration-blocker hiding in plain sight. Salesforce's human-friendly legacy naming (example quoted: analytics-bigdata-spark-executor-pool-m6a-32xlarge-az-a-b-c — 67 chars) produced metadata.labels: Invalid value: must be no more than 63 characters errors from Karpenter's label-dependent operations. The fix was refactoring naming conventions cluster-wide before the switch. "Seemingly minor technical constraints can become significant blockers in automated infrastructure management if not properly addressed early." (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

  7. Karpenter's bin-packing + consolidation can terminate single-replica pods without warning. A well-known gotcha of efficient bin-packing schedulers: consolidation (re-packing workloads onto fewer, larger nodes) can evict a singleton before a replacement has had time to start, causing service disruption. Salesforce's response was to roll out guaranteed pod lifetime features and workload-aware disruption policies to safeguard singletons. Reinforces the principle that "effective auto scaling solutions must balance infrastructure efficiency with application availability requirements, particularly for mission-critical services." (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

  8. Ephemeral-storage settings must be 1:1 translated, not defaulted. Some workloads failed to schedule after migration because ephemeral-storage configuration was incomplete on the Karpenter side. Fix: implement precise 1:1 mappings between ASG-defined volume settings and EC2NodeClass parameters. I/O-intensive applications are the canonical casualty. (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

  9. Heterogeneous instance types inside one node pool is a true capability unlock, not a feature nice-to-have. Post-Karpenter, a single Salesforce node pool can host GPU / ARM / x86 instances together — the scheduler picks whichever type best fits the pending pods right now. For platform teams this collapses pool-count × node-shape-count from an O(N×M) grid into a much smaller O(N) set. Also improves IP efficiency by decoupling node provisioning from specific subnets. (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

  10. Phased, risk-graded rollout is how you migrate 1,000 production clusters without a regression. Salesforce's sequencing: mid-2025 through early 2026, soak times between stages, least-critical environments first to validate tooling and ops, high-stakes production last. This is the phased-migration-with-soak-times pattern at its canonical scale. (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

Operational numbers

  • Fleet scale: 1,000+ EKS clusters ("one of the world's most complex Kubernetes platforms"); 1,180+ node pools.
  • Tenants: thousands of internal tenants across Salesforce.
  • Scaling latency: minutes (Cluster Autoscaler / ASG-driven) → seconds (Karpenter pending-pod-driven).
  • Operational overhead: 80% reduction in manual ops attributable to automation + self-service.
  • Cost savings: 5% FY2026 (rollout in progress); projected additional 5-10% FY2027.
  • Rollout window: mid-2025 → early 2026 (multi-stage with soak times).
  • Industry context: Datadog reports +22% Karpenter-provisioned node share in the last 2 years across surveyed Kubernetes fleets.
  • Label-length limit: Kubernetes metadata labels = max 63 characters (the breaking constant).
  • Example config surfaced in the post (legacy ASG mapping input):
  • k8s_instance_type: m6i.8xlarge, k8s_root_volume_size: 100, k8s_root_volume_iops: 3000, k8s_root_volume_type: gp3, k8s_root_volume_throughput: 125, k8s_min_node_number: 300, k8s_max_node_number: 2500, multi_az_provisioned_workers: false, asg_launch_type: launch_template, gpu_enabled: false.

Systems surfaced

Concepts surfaced

Patterns surfaced

Caveats

  • Not all numbers are disclosed. The post names a 5% FY2026 cost savings and a 5-10% FY2027 projection but does not enumerate compute cost base, workload mix, or FinOps methodology.
  • "80% reduction in manual operational overhead" is headline framing from the blog — no baseline measurement methodology disclosed.
  • No breakdown per workload class. The narrative names GPU / ARM / x86 + memory-intensive workloads but does not break down outcomes by workload type.
  • OPA PDB-validation policies are named but not shared. The specific rules enforced (minAvailable >= 2, required maxUnavailable bounds, etc.) are left as implementation detail.
  • The Karpenter transition tool / patching-check tool are not open-sourced as of publication. They are described but not linked.
  • Publication context. AWS Architecture Blog — the author list was truncated in the raw capture but the post sits in AWS's customer-architecture-case-study genre. Treat the success-metrics framing as AWS's voice, the operational lessons as Salesforce's substantive content.
  • Timeline is in-progress at publication. "With the Karpenter rollout still in progress…" — the FY2027 projection is a projection, not a measurement.

Source

Last updated · 200 distilled / 1,178 read