PATTERN Cited by 1 source
Interrupt and restart¶
Interrupt-and-restart is a preemption policy for scheduling: an active job can be interrupted at any point; the partial work done on the interrupted job is lost, but the job itself remains in the queue and can be retried later. "In this model, an online algorithm is allowed to interrupt a currently executing job. While the partial work already performed on the interrupted job is lost, the job itself remains in the system and can be retried" (Source: sources/2026-02-11-google-scheduling-in-a-changing-world-time-varying-capacity).
The pattern contrasts with three sibling preemption policies:
- Non-preemptive — once started, a job runs to completion. No interrupts. Simple but adversarially unwinnable for online throughput-maximisation (competitive ratio approaches zero).
- Interrupt-without-restart — an interrupted job is discarded entirely, not just its partial work. Strictly worse for online-throughput than interrupt-and-restart; in general, competitive ratio approaches zero, though it becomes constant-competitive under common deadlines.
- Checkpoint-and-resume — partial work is preserved; interrupting and resuming later continues where the job left off. Strictly more powerful than interrupt-and-restart but requires the runtime substrate to actually checkpoint application state, which is often not feasible.
Interrupt-and-restart is the middle preemption policy: weaker than checkpoint-and-resume (partial work is lost), stronger than interrupt-without-restart (the job survives). The load-bearing property is that the scheduler can walk back a bad commitment without losing the job permanently.
Why it's "highly beneficial"¶
The 2026-02-11 Google Research paper's result on interrupt-and-restart: "We found that the flexibility provided by allowing job restarts is highly beneficial. A variant of Greedy that iteratively schedules the job that finishes earliest continues to achieve a 1/2-competitive ratio, matching the result in the offline setting" (Source: sources/2026-02-11-google-scheduling-in-a-changing-world-time-varying-capacity).
The mechanism: under non-preemptive scheduling, an adversary can submit a long job first; the scheduler commits, and a subsequent burst of short jobs can't be scheduled before the long job completes. Under interrupt-and-restart, the scheduler can interrupt the long job when the shorts arrive, process the shorts, and then return to the long job from scratch. The bookkeeping cost is that the long job's prior work is wasted — but the gain is that arbitrarily many short jobs become schedulable instead of being starved.
The earliest-finish- greedy specific algorithm achieves the ½-competitive bound under this pattern: always run whichever queued job will finish earliest. If a new job would finish before the current job, interrupt and switch. The restart-allowed semantics means the switched-out job isn't lost — it just rejoins the queue and gets re-evaluated when the new job completes.
Contexts where the pattern applies¶
- Batch cluster schedulers. Borg, YARN, Slurm, Nomad — all support job preemption where the partial work is typically discarded and the job is either re-queued (for BestEffort / Batch priority) or failed (for Guaranteed). The re-queue path is this pattern.
- Kubernetes preemption. Pod
preemption (
preemption-policy: PreemptLowerPriority) is an interrupt-and-restart semantics: the evicted pod generally restarts from scratch. StatefulSet pods with persistent-volume attachments may partially preserve state (checkpoint-like), but the default behaviour for stateless workloads is interrupt-and-restart. - Spot/preemptible VM workloads. Spot instances are interrupted on short notice; workloads designed for spot must be restartable from scratch (interrupt-and- restart) or checkpoint (checkpoint-and-resume). The interrupt-and-restart variant is simpler operationally and dominates in practice for stateless data-processing jobs.
- MapReduce / Spark task retries. A task that fails (for any reason, including preemption) is re-executed from scratch by the framework. This is interrupt-and-restart at the task level; framework-level job state is preserved separately.
Pattern trade-off summary¶
| Preemption policy | Work preservation | Online-scheduling competitive ratio | Typical production use |
|---|---|---|---|
| Non-preemptive | Full | Approaches 0 (adversarial) | Hard-real-time; small jobs only |
| Interrupt-and-restart | None (work lost) | ½ (matches offline bound) | Batch clusters; spot instances; stateless workloads |
| Interrupt-without-restart | None (and job lost) | Approaches 0 in general; constant under common deadlines | Strict deadline-bound workloads |
| Checkpoint-and-resume | Full | Matches or exceeds offline | Stateful pipelines; long-running ML training |
Seen in¶
- sources/2026-02-11-google-scheduling-in-a-changing-world-time-varying-capacity — interrupt-and-restart preemption achieves the ½-competitive online-scheduling bound via the [[patterns/earliest-finish- job-greedy|earliest-finish-job greedy]], matching the offline optimum up to a factor of 2.