Skip to content

PATTERN Cited by 1 source

Scheduled cron-based scaling

Intent

Scale a workload's capacity up and down on a clock, not on a signal, using Kubernetes CronJobs (or external schedulers) to mutate the workload's replica count on a fixed daily / weekly / hourly schedule. The pattern replaces or sits beside reactive autoscalers (HPA, KEDA, custom autoscalers) for workloads whose load shape is predictable in time.

When it fits

  • Traffic has a strong daily / weekly periodicity. E-commerce morning peaks, news sites' weekday 9am spikes, batch-processing overnight windows, enterprise weekday-only usage.
  • Provisioning latency is significant relative to traffic ramp — scaling out a stateful workload (Elasticsearch, Cassandra, Kafka) can take minutes or tens of minutes; reactive autoscaling lags load. Pre-scaling eliminates the lag.
  • Cost pressure is on the off-peak side — paying for peak capacity 24/7 is wasteful; scaling down during off-hours returns real dollars.
  • The workload has a clear low-point schedule — nightly, weekends — when reduced capacity is actually acceptable for latency/availability SLOs.

Shape

For each scale transition:

  • CronJob per direction per workload. Typically at minimum two per workload: scale-up at the start of peak hours, scale-down at the end. Real deployments accumulate more (mid-morning scale-out for a deeper peak, experimental-workload scale-down for off-hours).
  • Cronjob action: mutate declarative spec. kubectl patch deployment ... -p '{"spec":{"replicas":N}}' or kubectl patch <crd> ... -p '{"spec":{"replicas":N}}'. The CronJob does not directly modify pods — it modifies the shape the controller is reconciling against.
  • Operator / controller reconciles. For stateless workloads, the Deployment controller handles the diff. For stateful workloads (EDS, custom CRDs) a custom operator handles it.

Canonical wiki instance

Zalando Lounge's Elasticsearch cluster, from the 2024-06-20 post:

"our business model is such that we receive about three times the normal load during the busy hour in the morning and therefore we use schedules to automatically scale in and out applications to handle that peak [...] For us, the schedule based scaling is implemented by a fairly complex set of cronjobs that change the number of nodes by manipulating the EDS for our cluster. There's separate cronjobs for scaling up at various times of day and scaling down at other times of day."

The workload: Lounge runs 6 Elasticsearch pods at night, 7+ in the morning; the morning peak is 3× baseline.

Failure modes uncovered at Zalando Lounge

The 2024-06-20 post is valuable precisely because it catalogues four failure modes of the pattern on a stateful zone-aware workload. All four were exposed by a K8s 1.28 upgrade that perturbed pod-to-zone placement:

  1. Per-zone floor not encoded. The nightly scale-in floor (6 pods) was sized globally, not per-zone. Under some pod-to-zone distributions, the highest-ordinal pod (next to drain) was alone in its zone, triggering concepts/zone-aware-shard-allocation-stuck-drain. Proper floor: max(global_floor, per_zone_floor * num_zones).
  2. Conflicting cronjobs race. The stuck nightly scale-in was still retrying when the morning scale-out cronjob fired, producing two in-flight EDS updates. Exposed an es-operator ctx-cancellation bug.
  3. Organizational drift. The post-incident "quick fix" touched the main scale-down cronjob but missed a separate experimental cronjob with its own schedule, producing a third morning alert. "The quick fix we did the day before only touched the major nightly scale down job, but ignored another one related to a recent experimental project. It was a trivial mistake, but enough to cause a bit of organisational hassle." (Source: same.)
  4. Drift sensitivity. Ephemeral storage + scheduler re-spread meant that the pod-to-zone distribution assumed by the scaling plan could be invalidated by routine K8s infrastructure work. See concepts/ephemeral-storage-cross-zone-drift.

Implementation constraints

  • Enumerate all schedules centrally. A registry of "what cronjobs mutate this CRD" prevents the "we forgot the experimental one" failure mode. At least a comment in the CRD's owner README; ideally a tool that lists them.
  • Encode invariants at the planning layer. For zone-aware workloads: per-zone floors, not only global floors. For leader-follower workloads: pin-the-leader floors. For shared-state workloads: quorum-preserving floors.
  • Alert on scale-stuck, not only "too few pods." The symptom alert in the Lounge incident ("too few running Elasticsearch nodes") fired late, after the drain had been spinning for hours. A duration-based "drain in progress > T minutes" alert catches the root cause sooner.
  • Cronjobs should be idempotent. Running the same scale-to-N cronjob twice should not produce different state than running it once. Most operators handle this automatically (declarative spec), but app-level scripts are easy to get wrong.

Contrast with reactive autoscaling

  • HPA / KEDA react to a metric signal (CPU, queue depth, custom metric). They are right when load is unpredictable, and slower than scheduled scaling on sharp periodic ramps. Many production setups combine both: HPA minimums set by schedule (pre-scale baseline), HPA handling the unplanned excursions.
  • Predictive autoscaling (AWS, Google) uses historical metrics to learn the schedule rather than requiring human encoding. Works when traffic is genuinely periodic and historical; fails on novel events.
  • Pure cron-based scaling (this page) is the simplest shape when the periodicity is already known. Its weakness is that the human schedule is the ground truth; any drift between real traffic and scheduled capacity is human-visible only.

Seen in

  • sources/2024-06-20-zalando-failing-to-auto-scale-elasticsearch-in-kubernetes — canonical wiki instance. Zalando Lounge runs cron-based scaling against an EDS for a zone-aware Elasticsearch cluster; four failure modes catalogued across three consecutive morning incidents; closing lesson "Read the code" emphasises that the pattern's operator substrate (es-operator's drain code) must hold up under the concurrent-spec-change edge cases the pattern generates.
  • patterns/scheduled-cron-triggered-load-test — sibling pattern: Kubernetes CronJob triggers recurring load tests rather than capacity changes. Same substrate primitive (CronJob), different consumer.
Last updated · 501 distilled / 1,218 read