Skip to content

CONCEPT Cited by 1 source

Online scheduling

Online scheduling is the problem class where "jobs arrive dynamically and the scheduler must make immediate, irrevocable decisions without knowing what jobs will arrive next" (Source: sources/2026-02-11-google-scheduling-in-a-changing-world-time-varying-capacity).

The "online" qualifier is the load-bearing distinction:

  • Offline — the full job list is known in advance; scheduling is a pure combinatorial-optimisation problem.
  • Online — jobs arrive one by one; the scheduler commits on each arrival without lookahead. Quality is measured via competitive ratio against the offline optimum.

This is the regime every production cluster scheduler operates in. Borg, Kubernetes, Mesos, Nomad, YARN, Slurm — each makes placement decisions as jobs arrive, without full knowledge of future submissions.

Three preemption regimes

The 2026-02-11 Google Research post structures online scheduling around three preemption policies whose competitive-ratio landscapes are sharply separated (Source: sources/2026-02-11-google-scheduling-in-a-changing-world-time-varying-capacity):

  • Non-preemptive — once a job starts, it runs to completion. Competitive ratio approaches zero against an adversary that feeds a long job first then a burst of shorts. "A single bad decision of scheduling a long job can ruin the possibility of scheduling many future smaller jobs."
  • Interrupt with restart — a running job can be preempted; partial work is lost but the job stays in the queue. "Only jobs restarted and later completed non-preemptively count as successful." Highly beneficial — the [[patterns/ earliest-finish-job-greedy|earliest-finish greedy]] closes the gap to the offline ½-competitive bound.
  • Interrupt without restart — a running job can be preempted; partial work and the job itself are discarded. Competitive ratio approaches zero again in general; becomes constant-competitive under common deadlines via a tentative-schedule revision algorithm.

The preemption-policy choice is the first-order architectural decision for an online scheduler — not a secondary optimisation. The competitive-ratio floor moves from zero to a constant (and back) based purely on which policy is in use.

Time-varying capacity

The Google 2026-02-11 paper's full problem statement includes "time-varying capacity" — the number of jobs the scheduler can run concurrently varies over wall-clock time. This models real cluster behaviour: diurnal interactive-load, spot- instance availability, hardware failures, planned maintenance all continuously change the batch-scheduler's usable capacity profile. The theoretical analysis works over this capacity profile rather than assuming static concurrency.

The raw capture of the post focuses on the online-setting preemption analysis and doesn't develop the time-varying capacity dimension quantitatively; this page treats time-varying capacity as the context for the captured results rather than as an independently analysed dimension.

Relationship to online bin-packing

Online scheduling-over-time and online bin-packing-in-space are sibling sub-genres within the broader "commit under uncertainty about future arrivals" problem family:

  • Bin-packing — items arrive with sizes, must be placed into fixed-capacity bins; no time dimension. Google 2025-10-17 [[systems/lava-vm- scheduler|LAVA / NILAS / LARS]] is the canonical wiki instance of online bin-packing with learned-lifetime predictions.
  • Online scheduling (this page) — jobs arrive with durations, must be placed into a time schedule; explicit time dimension. Google 2026-02-11 is the canonical wiki instance of online throughput-maximising scheduling with competitive-ratio analysis.

Together the two Google Research papers pin scheduling-under- future-uncertainty as a recurring shape with different primary primitives (ML-based lifetime prediction + continuous reprediction in LAVA; tentative-schedule revision + competitive-ratio analysis in 2026-02-11).

Production framing

The 2026-02-11 post motivates the common-deadline variant via "all data processing must finish by the nightly batch run" — the exact shape of a nightly ETL / batch-scheduling environment where every job must land before a shared wall-clock cutoff. The production implication: for batch clusters operating against a nightly deadline, interrupt-without-restart scheduling with a tentative- schedule revision algorithm is a theoretically principled design with a constant-competitive guarantee, unlike naive online scheduling variants whose worst case is unbounded.

Seen in

Last updated · 200 distilled / 1,178 read