Skip to content

SYSTEM Cited by 1 source

OpsPlanner

What it is

OpsPlanner is Meta's unified disruptive-work orchestrator for the production fleet — the single choke point through which every maintenance operation (firmware update, OS patch, driver upgrade, reboot, hardware verification) is serialised, scheduled, and gated before it runs.

Introduced publicly in the 2024-06 "Maintaining large-scale AI capacity at Meta" post as the coordination substrate making Meta's maintenance trains safe across >30 operation types and >50 component classes simultaneously.

Capabilities named in the source

From sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta:

  1. Overlap serialisation. OpsPlanner "can work on overlapping scopes of operations and correctly serialize them." Different maintenance operations may target intersecting host sets; OpsPlanner orders them so they don't clobber each other.
  2. Safe drain / return to production. It "takes them safely out and into production" — owning the transition of a host from serving state → maintenance state → verified → back in service.
  3. Built-in handover flow. "It has a built-in handover flow that ensures correct escalation behavior and avoids overlaps and deadlocks." Handover is the load-bearing primitive: a host finishing one operation cleanly handing off to the next operation's workers without unlocking capacity prematurely.
  4. Pre-return verification. "OpsPlanner can also ensure upgrades are applied to hosts before they are returned to production." A host with a half-finished upgrade cannot re-enter service.
  5. Owns buffers. "OpsPlanner owns planned maintenance and failure buffers and safeguards them." This is the load- bearing property that lets Meta run at capacity while still having headroom to absorb failures during maintenance — the buffer isn't advisory, it's enforced centrally.

Scale disclosure

  • ~1,000,000 operations per day — Meta's only explicit throughput datum for OpsPlanner (source text).
  • Fleet size implied: "dozens of AI clusters" + Meta's non-AI compute + storage — OpsPlanner handles maintenance across all, not only AI. "Meta maintains its fleet of clusters using a technique called maintenance trains. This is used for all capacity, including compute and storage capacity."

Safety features adjacent to OpsPlanner

The post names these as fleet-level safety features in the same section; they are most naturally implemented as OpsPlanner policies, though the post does not confirm attribution:

  • Autostop of maintenance trains if buffers are exhausted — trains halt when they'd take capacity below the planned / failure buffer floor.
  • Automatic offboarding of failing upgrades — a patch class observed to fail is removed from active rollout without human intervention.
  • Rollout phases"only well-tested changes reach global systems"; OpsPlanner most likely owns the phase-to-phase graduation gates.

Why a unified orchestrator

The architectural motivation for OpsPlanner (inferred from the source) is that each of Meta's >50 component-upgrade domains historically had its own partial scheduler; at scale, these schedulers interfered with each other — one team's firmware upgrade stepping on another team's driver upgrade on the same host. Unification is the response: one queue, one policy engine, one buffer owner.

This is structurally parallel to the "central proxy choke point" pattern used elsewhere in the wiki for API / request traffic — here applied to fleet-mutating operations rather than serving traffic.

What's not disclosed

  • Implementation: queue technology, scheduler algorithm, consensus substrate, state store are not described.
  • Policy model: how priorities between operations are expressed, how escalation paths are defined.
  • Failure modes: what happens if OpsPlanner itself is unavailable — fail-open (operations proceed uncoordinated) or fail-closed (fleet drift accumulates)?
  • Integration with job schedulers: how OpsPlanner interacts with Meta's training job scheduler to honor "contract with services that allows them to avoid maintenance-train interruptions" is not described.

Seen in

Last updated · 319 distilled / 1,201 read