Skip to content

PATTERN Cited by 2 sources

Asymmetric autoscaling — aggressive up, conservative down

Pattern

When implementing reactive autoscaling for latency-sensitive, spiky workloads, deliberately make the scale-up policy more aggressive than the scale-down policy. Specifically:

  • Scale-up — short sustained-up window, fast provisioning, willing to over-provision briefly for tail-latency safety.
  • Scale-down — long sustained-down window, slow rate of removal, willing to keep extra capacity for stability.

The asymmetry breaks the symmetric-threshold flapping that causes latency spikes. The cost — small over-provisioning during ramps and decay periods — is strictly cheaper than the latency violation an under-provisioned ramp would cause.

(Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)

The forcing condition

Symmetric autoscaling — same threshold and dwell-time on up and down — flaps under common workload patterns:

  1. Diurnal traffic with rapid ramps — load crosses the threshold up; autoscaler adds capacity; load briefly drops below threshold (because more capacity reduces per-pod load); autoscaler removes capacity; load rises above threshold; cycle repeats. Each cycle adds latency variance.
  2. Bursty / spiky traffic — short bursts trigger scale-up; capacity arrives just as the burst subsides; scale-down immediately starts; next burst hits half-capacity again.

Both patterns map to flapping (see concepts/anti-flapping). Flapping causes latency spikes because each scale-down event removes warm capacity that the next ramp has to re-warm.

The asymmetric fix breaks the loop: scale-up is fast enough to absorb the next ramp; scale-down is slow enough that a brief quiet period doesn't trigger removal.

Canonical wiki disclosure

The 2026-05-08 Databricks Model Serving / Superhuman post is the wiki's canonical disclosure of asymmetric autoscaling for production GPU inference at 200K+ QPS:

"To keep the platform cost-optimal for variable traffic patterns, the system autoscales dynamically with customer demand. The autoscaler tracks request_concurrency averaged across pods, with per-pod concurrency targets derived from benchmarking maximum sustainable RPS per replica. The scaling strategy is intentionally asymmetric: scale-up is aggressive and responsive, while scale-down is conservative, to prevent the flapping that causes latency spikes."

"Through joint shadow testing between Superhuman and Databricks, we caught edge cases and fixed issues when tuning parameters on autoscaler, including when to scale aggressively, when to hold steady, and how conservative to be on scale-down."

The traffic shape disclosed is strong diurnal patterns with rapid ramps in certain periods, often exceeding 200k QPS — the exact regime the pattern is designed for.

Required substrate

Aggressive scale-up is only credible if the time to add a pod is small. The Superhuman/Databricks deployment couples asymmetric autoscaling with lazy-loading container images that cut pod start from minutes to seconds. Without the substrate, "aggressive" is wishful — the container runtime is the rate-limiter, not the autoscaler.

The general dependency: aggressive scale-up requires a low-cost fast-start primitive at every layer of the pod-bringup stack.

Operational shape

            request_concurrency (avg across pods)
   ─────────────┼──── target ──────────────────  ▲
                │                                │  scale-up zone:
                │                                │  short window,
                │                                │  add fast
   ─────────────┴────── target_with_scale_down_hysteresis ───  ▼
                                                                scale-down zone:
                                                                long window,
                                                                remove slowly
                                                                cap on rate

The gap between the two thresholds (hysteresis_band) is what prevents flapping. Width is tuned to the workload's noise floor.

Tunable parameters

The parameters tuned in the Superhuman shadow-testing iteration:

  • Sustained-up window (seconds). How long average concurrency must exceed the target before scale-up triggers.
  • Sustained-down window (minutes). How long average concurrency must sit below the lower threshold before scale-down triggers.
  • Scale-up rate — pods added per minute. Aggressive: add many at once; conservative: add one at a time.
  • Scale-down rate — pods removed per minute. Always more conservative than scale-up rate.
  • Hysteresis band — gap between the up and down thresholds.
  • Floor / ceiling — minimum and maximum replica counts.

The Superhuman post does not disclose specific numerical values; the post emphasises that shadow-testing between the two teams was the mechanism for finding them.

When to use

  • Latency-sensitive workloads with strict SLOs (sub-second p99 in this case) where flapping-induced spikes matter.
  • Spiky / diurnal traffic where ramp shape is unpredictable.
  • Inference and other GPU-bound workloads where pod start cost is high enough that flapping wastes warm capacity.
  • Cases where slight over-provisioning is cheap relative to the cost of an SLO breach.

When not to use

  • Latency-tolerant batch workloads where flapping does not cause user-visible harm.
  • Capacity-cost-dominated workloads where over-provisioning is more expensive than latency variance — symmetric scaling may be cheaper.
  • Predictable traffic patterns that admit predictive scaling — forecast-driven schedules can dispense with reactive thresholds altogether.

Failure modes

  • Scale-up too aggressive → over-provisioning during transient spikes. Mitigation: cap scale-up rate; combine with predictive-scale-up for known patterns.
  • Scale-down too conservative → permanent over-provisioning during quiet periods. Mitigation: time-of-day floor adjustments; enforce a maximum hold-up time before scale-down triggers.
  • Dependent fast-start substrate breaks (slow image pull, GPU-init regression) → "aggressive" scale-up becomes meaningless. Mitigation: monitor pod-start time as a first-class SLI alongside request latency.

Sibling patterns and concepts

Seen in

  • sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — canonical wiki instance at 200K+ QPS GPU inference altitude on Databricks Model Serving. Tracks request_concurrency averaged across pods with per-pod target derived from benchmarking; "intentionally asymmetric: scale-up is aggressive and responsive, while scale-down is conservative, to prevent the flapping that causes latency spikes." Tuned via joint shadow testing with the customer.
  • sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model — second canonical instance at the Custom Model Serving / heterogeneous-model altitude. Discloses specific timing parameters: horizontal scale-up decides every 5 seconds based on 20 seconds of traffic; scale-down considers ~5 minutes. Vertical concurrency is also asymmetric: quick to reduce concurrency under stress, slow to increase. "The cost of premature scale-down (a cold start at the worst possible moment) outweighs the cost of keeping a few idle replicas temporarily." Can go 10 → 10K QPS in <60 seconds; customers reported up to 5× reduction in queueing and 429s with this policy.

Caveats

  • The Superhuman post does not disclose specific numeric parameters (window lengths, scale rates, hysteresis band).
  • Aggressive scale-up consumes cluster capacity quickly — the cluster autoscaler under it must also be aggressive, or pod-pending queues form.
  • The pattern's payoff depends on how quickly added capacity becomes useful; for workloads where weight loading dominates pod-bringup (large foundation models), additional substrate beyond image lazy-loading is needed.
Last updated · 542 distilled / 1,571 read