SYSTEM Cited by 3 sources

Borg¶

Borg is Google's large-scale cluster-management system — the platform that packs jobs onto machines across Google's fleet, enforcing isolation, reclaiming resources, restarting failed tasks, and exposing a uniform scheduling surface to Gmail, YouTube, Maps, Search, and every other Google service. The foundational 2015 EuroSys paper ("Large-scale cluster management at Google with Borg") is the canonical public reference; Kubernetes is the open-source lineage descendant.

What shows up on this wiki¶

Borg appears in the wiki so far as the production target of Google Research's Regression Language Model work (2025-07-29), where text-to-text regression is used to predict Borg's own scheduler-efficiency metric without running the expensive combinatorial solver.

Workload breadth. Google explicitly lists GMail, YouTube, and Maps among the production workloads Borg schedules in the 2025-07-29 post — the point being that the RLM has to generalise across the full diversity of Google's service portfolio, not just one workload class (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).
Heterogeneous hardware. Borg schedules across CPUs and TPUs in the same fleet; the RLM input includes hardware descriptors so the same model can predict MIPS-per-GCU on any machine type (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

MIPS per GCU — the efficiency metric¶

The 2025-07-29 post frames MIPS per GCU (Millions of Instructions Per Second per Google Compute Unit) as the "key efficiency metric" Borg uses to judge whether a proposed allocation is a good one. GCU is Google's internal fleet-normalised unit of compute; MIPS-per-GCU is effectively useful-work-produced per unit-of-compute-spent.

Accurate MIPS-per-GCU forecasting matters because:

Scheduling across thousands of machines involves choosing among many candidate placements per job; an efficient-placement predictor lets the scheduler short-circuit bad candidates.
At Google fleet scale, even single-digit efficiency percentage points translate into billions of dollars of hardware (Source: sources/2025-07-29-google-simulating-large-systems-with-regression-language-models).

The Borg digital twin¶

Google operates a digital twin of Borg — a backtesting framework that replicates the state of real-world clusters for counterfactual evaluation. The 2025-07-29 post names this digital twin as:

The training-data source for the RLM: synthesised (x = cluster-state-as-string, y = MIPS-per-GCU-from-bin-packer) pairs come out of running the bin-packing algorithm inside the twin.
The ground-truth fallback implied by the fast-path/slow-path deployment: when the RLM reports high uncertainty, the slow bin-packing simulation is the authoritative answer.

The digital twin itself is not described architecturally in the 2025-07-29 post — only named and used.

Scheduler = bin-packing¶

Borg's core scheduling decision is bin packing: given a job with resource requests (CPU, RAM, disk, GPU/TPU, network, etc.) and a fleet of machines with remaining capacity, pick a machine (or a set of machines) to run the job on. The 2025-07-29 post names the specific target of the RLM as "the numeric result of a specialized bin-packing algorithm used to efficiently allocate tasks to resources" — i.e. the scheduler's objective function, not raw CPU counters or memory utilisation.

This is why Google frames the RLM as a simulator of Borg rather than a monitor: it predicts what Borg's scheduler would have decided, not what the hardware is currently doing.

What the wiki doesn't yet have¶

Borg's architecture itself (BorgMaster, Borglets, scheduler, Paxos-replicated state) — not introduced in the 2025-07-29 post, pending an ingested source that covers the 2015 paper.
Google Compute Unit (GCU) definition — referenced by performance prediction sources but not separately documented.
The digital twin's implementation — the 2025-07-29 post only uses it as a black box.

VM allocation as lifetime-aware bin-packing (2025-10-17)¶

The 2025-10-17 Google Research LAVA post re-opens Borg-adjacent scheduling as a second ML-for-systems angle on the same substrate — at a different layer from the 2025-07-29 RLM work. Where the RLM predicts the bin-packer's output (MIPS per GCU) so the scheduler can short-circuit the slow solver, the LAVA family augments the bin-packer's policy with learned VM lifetime predictions so placement itself becomes lifetime-aware (Source: sources/2025-10-17-google-solving-virtual-machine-puzzles-lava).

Problem framing. VM allocation is online bin-packing with pieces that "appear and disappear" at unknown times. Naive packing produces two named failure modes — resource stranding and empty-host loss — that the LAVA family explicitly targets.
Load-bearing primitive: continuous reprediction of the remaining-lifetime distribution. Replaces the naive single-prediction-at-creation approach, whose structural hazard is that "a single misprediction can tie up an entire host for an extended period, degrading efficiency".
Three insertion points: NILAS scoring, LAVA allocation, LARS rescheduling — see systems/lava-vm-scheduler for the full trio.
Production-deployment status on Borg: not disclosed in the raw capture; the arXiv paper is the authoritative source.

Together the 2025-07-29 + 2025-10-17 pair pins two Google Research ML-for-systems proof points on Borg-adjacent infrastructure at different layers — output-prediction (RLM) and policy-intervention (LAVA / NILAS / LARS).

Online throughput scheduling theory (2026-02-11)¶

Google Research's 2026-02-11 "Scheduling in a changing world: Maximizing throughput with time-varying capacity" post introduces a third Borg-adjacent scheduling proof point — this time from the algorithmic-theory side rather than the ML-for-systems side. The production motivating example is named directly as "all data processing must finish by the nightly batch run", the exact shape of a Borg batch-job schedule (Source: sources/2026-02-11-google-scheduling-in-a-changing-world-time-varying-capacity).

Problem class. Online throughput-maximising scheduling under a time-varying capacity profile — the number of jobs the scheduler can run concurrently varies over wall-clock time (diurnal load, spot preemption, hardware failures).
Competitive-ratio landscape across preemption regimes. Non-preemptive online scheduling has competitive ratio approaching zero (one long-job commitment can starve arbitrarily many shorts). Interrupt-and- restart preemption recovers the offline ½-competitive bound via the earliest-finish-job greedy. Interrupt-without-restart is adversarially unwinnable in general but becomes constant- competitive under common deadlines.
Algorithmic primitive: tentative schedule revised by a fixed four-action rule on each job arrival (unit-capacity common-deadline variant). The full four-action specification is in the paper but not in the raw capture.
Production-deployment status on Borg: not disclosed in the raw capture; the paper is research-side algorithmic theory with production-shape motivation.

With the 2026-02-11 entry the wiki now has three Google Research proof points on Borg-adjacent scheduling at three different intervention layers:

Insertion point	Approach	Canonical post
Bin-packer output prediction	ML approximator for MIPS-per-GCU	2025-07-29 systems/regression-language-model (RLM)
VM-allocation policy	Learned-lifetime distribution + continuous reprediction	2025-10-17 systems/lava-vm-scheduler (LAVA / NILAS / LARS)
Online-throughput scheduling theory	Competitive-ratio analysis + tentative-schedule revision	2026-02-11 sources/2026-02-11-google-scheduling-in-a-changing-world-time-varying-capacity

Each post targets a different layer of Borg's scheduling stack with a different primary primitive — the recurring shape is "Borg scheduling is rich enough to support orthogonal interventions at the prediction, policy, and theory layers simultaneously".

Seen in¶

sources/2025-07-29-google-simulating-large-systems-with-regression-language-models — Borg as the production target of Google's RLM work; MIPS-per- GCU as the efficiency metric; digital-twin backtesting as the training-data source.
sources/2025-10-17-google-solving-virtual-machine-puzzles-lava — Borg-adjacent VM allocation as online bin-packing with learned lifetime distributions; LAVA / NILAS / LARS trio.
sources/2026-02-11-google-scheduling-in-a-changing-world-time-varying-capacity — online throughput-maximising scheduling theory for time-varying capacity profiles; competitive-ratio analysis across three preemption regimes; common-deadline variant motivated by "all data processing must finish by the nightly batch run".