CONCEPT Cited by 2 sources

Fleet patching¶

Definition¶

Fleet patching is the operational capability of a managed-service vendor to deploy a software change (security patch, bug fix, config update) across the whole fleet of customer instances within a bounded time window, with known velocity, known failure semantics, and a known customer-courtesy contract for scheduling.

Distinct from in-place upgrade (one cluster, customer-driven), staged rollout (vendor-side but not scoped to a single change), and disaster recovery (fleet-wide but event-driven, not software-driven). Fleet patching is the specific capability of "here is a binary / config / rule; deploy it everywhere; keep customers running; tell me when it's done."

Why it matters¶

Fleet-patching velocity is the primary defence mechanism for the plurality of users on any managed service when a vulnerability ships in the vendor-operated substrate. The shared-responsibility line position determines who owns patch velocity — the vendor for managed-tier users, the customer for self-hosted-tier users — and the vendor's fleet-patching capability sets the upper bound on how quickly the managed tier can be made safe.

Also the enabling capability for vendor-first-patch coordinated disclosure: if the vendor can patch the fleet faster than attackers can develop exploits after CVE publication, the disclosure clock can be collapsed inside the vendor boundary.

The velocity-vs-safety trade-off¶

Fleet patching has four axes the vendor tunes:

Velocity — how fast the change reaches the last instance. Bounded by the number of parallel patch workers, rollback capability, and customer-courtesy contracts (maintenance windows).
Safety — how much a bad patch can damage. Mitigated by canary-first rollout, region-by-region phasing, staged rollout, and fast-rollback tooling.
Customer courtesy — how much scheduling control customers keep. Bounded by the maintenance-window contract: customer controls routine cadence, vendor reserves an emergency channel.
Observability — how the vendor knows the patch landed cleanly. Needs per-instance state ("patched" / "failed" / "deferred"), aggregate SLO tracking, and incident-response triggers if the patch itself causes errors.

Pushing velocity without the other three produces self-inflicted outages of the shape the Cloudflare 2025-12-05 outage documents — a fast, fleet-wide, insufficiently-staged rollout.

Canonical datapoint¶

MongoDB CVE-2025-14847 / "Mongobleed" (2025-12-12 → 2025-12-18):

Scale: "tens of thousands of Atlas customers and hundreds of thousands of Atlas instances" (order-of-magnitude, undecomposed).
Velocity: ~4.7 days detection → majority of fleet; ~6 days detection → full fleet including maintenance-window customers.
Courtesy contract preserved: maintenance-window customers got ~15 hours of pre-notification (2025-12-17 21:00 → 2025-12-18) before their forced patch landed. "Maintenance window" as a concept is honoured; emergency override is pre-announced rather than silent.
Tiered rollout: Atlas (managed, vendor-driven) vs Enterprise Advanced (customer-driven, patched-version distribution) vs Community Edition (customer-driven, community-forum notification). Three customer populations, three rollout velocities, three shared-responsibility line positions.

Not reproducible as a benchmark — no per-region / per-tier / per- cluster-size breakdown, no success-rate / rollback statistics — but sets an industry reference for managed-database fleet-patching velocity.

Structural requirements for fleet patching¶

From the MongoDB post and the broader wiki:

Vendor owns the deployment substrate. Atlas's managed provisioning + scaling + backup is the precondition; vendors who can't touch customer runtime can't fleet-patch.
Per-instance state tracking. Which instance is on which binary version; which patches landed; which are deferred.
Rollback path. If the patch breaks, one-click (or one-script) reversion to the previous known-good version. Cloudflare's incident retrospectives repeatedly emphasise rollback as the load-bearing failure-response primitive.
Customer-courtesy contract — documented maintenance windows with an explicit emergency-override policy. Courtesy is honoured by pre-notification, not silence.
Observability across the rollout. Aggregate status board, per-region velocity metrics, error-rate deltas correlated with patch progress.
Communication plan for customer-side coordination: status- page updates, email-to-affected-customers, post-hoc retrospective publication.

Relationship to other fleet-wide operational capabilities¶

Fleet-patching ≠ fleet-rollout. Rollout (feature release) has different risk characteristics — generally longer time windows, heavier staging, often opt-in. Patching is often mandatory, fast, opt-out-restricted.
Fleet-patching ≠ configuration push. Config push is the highest-velocity variant (seconds to minutes fleet-wide); binary patching is typically the slower variant (minutes to hours to days). See killswitch + global configuration system pairing in Cloudflare's architecture.
Fleet-patching does not substitute for defense-in-depth. Even fast fleet-patching leaves a detection- to-fix window open; the vulnerability may have been exploitable before detection. DiD layers below the patched defect — auth / network / encryption / audit — catch what the patch didn't.

Seen in¶

sources/2025-12-30-mongodb-server-security-update-december-2025 — Atlas fleet of hundreds of thousands of instances patched in ~6 days for CVE-2025-14847; maintenance-window customers honoured with pre-notification; tiered rollout to EA + Community via different channels. Canonical managed-database wiki instance.
sources/2024-06-16-meta-maintaining-large-scale-ai-capacity-at-meta — Meta's GPU+compute+storage fleet, routed through maintenance trains coordinated by OpsPlanner (~1M ops/day). Canonical internal-compute-fleet wiki instance; adds two primitives the MongoDB variant lacks: (1) a capacity-guarantee contract ("all capacity minus one maintenance domain" up 24/7), and (2) two-layer consistency (upper-layer pinned, lower-layer slides) which is load-bearing for synchronised AI training jobs and absent from the managed-database variant.

concepts/maintenance-window — the customer-facing scheduling contract fleet-patching interacts with.
concepts/maintenance-domain — the unit of capacity each patch visit operates on (Meta variant).
concepts/overlapping-rollouts — the fleet-level concurrency property at hyperscale; Meta names it explicitly.
concepts/host-consistency-sliding-upgrade — the two-layer discipline that makes overlapping rollouts safe for synchronised workloads.
concepts/coordinated-disclosure — the industry norm whose vendor-first-patch variant depends on fleet-patching velocity.
concepts/shared-responsibility-model — the line that determines who owns patch velocity.
concepts/blast-radius — size of damage from a bad patch; the risk fleet-patching must mitigate via staged rollout.
patterns/rapid-fleet-patching-via-managed-service — the pattern realising fleet-patching for a managed-database tier.
patterns/maintenance-train — the pattern realising fleet-patching for an internal-compute-fleet tier (Meta variant).
patterns/gradual-rollout-layered-by-stack-depth — the rollout discipline Meta's variant adds on top of the train mechanism.
patterns/pre-disclosure-patch-rollout — the disclosure posture fleet-patching velocity enables.
patterns/staged-rollout — generic rollout-phasing pattern fleet-patching specialises.
patterns/progressive-configuration-rollout — config-push sibling of fleet-patching.
systems/opsplanner — Meta's central serialiser that owns the patch-to-production path in the compute-fleet variant.