Skip to content

CONCEPT Cited by 2 sources

Scaling latency

Scaling latency is the time between "we need more capacity" and "that capacity is actually serving traffic." For a cluster scaler, it decomposes into roughly:

scaling_latency =
    detection_time        # how long to notice we're under-resourced
  + decision_time          # pick an instance type / count / zone
  + provisioning_time      # cloud provider actually launches compute
  + bootstrap_time         # boot + userdata + kubelet register
  + pod_scheduling_time    # scheduler places pods on the new node
  + pod_ready_time         # container pull + startup + healthcheck

Different autoscalers fail on different parts of this sum.

Why it matters

  • User-visible p99 latency during traffic spikes. If scaling_latency > spike_duration, your spike passes through the system as errors / queue backpressure / timeouts before capacity catches up.
  • Over-provision as compensation. Teams with slow autoscalers over-provision baseline capacity to hide scaling latency — which converts an autoscaler problem into a cost problem.
  • Service-level disruption. Salesforce's post names "multi- minute delays during demand spikes and degraded user experience" as one of the central ASG / Cluster Autoscaler limitations that drove the migration to Karpenter (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters).

The Cluster Autoscaler / ASG tax

Kubernetes's Cluster Autoscaler on AWS inflates scaling latency through a chain of indirections:

  1. Cluster Autoscaler sees an unschedulable pod.
  2. Asks the matching Auto Scaling group to grow.
  3. ASG calls EC2 RunInstances.
  4. Instance launches, boots, ENI attaches, userdata runs, kubelet registers.
  5. Now the scheduler can place pods.

Each hop has its own latency budget; the sum at scale is minutes.

Karpenter's collapse to seconds

Karpenter skips steps 2-3:

  1. Karpenter sees pending pods.
  2. Picks instance types + AZs by solving a bin-packing problem over pending pods vs. the EC2NodeClass's allowed instance types.
  3. Calls EC2 RunInstances directly.
  4. Node boots and registers.
  5. Scheduler places pods.

Salesforce reports minutes → seconds on this metric post- migration (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters). The win isn't a faster step — it's eliminating the ASG round trip entirely.

Seen in

  • sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters — scaling latency is named as one of four structural motivations for the 1,000-cluster migration; CA/ASG minutes → Karpenter seconds is reported as a headline outcome.
  • sources/2026-04-07-mongodb-predictive-auto-scaling-an-experimentmanaged-database-tier instance of the same latency force: MongoDB Atlas's reactive auto-scaler pays "a few minutes of overload" detection + "several minutes" scaling operation on top of one- tier-at-a-time motion. Predictive auto-scaling hides the full scaling-latency sum from observed p99 by starting the op before load arrives. Sibling to the cluster-scheduler framing above: same force (detection + decision + provisioning + warmup), different layer, different remediation (predictive forecasting vs. per-pod node provisioning).
Last updated · 200 distilled / 1,178 read