PATTERN

Rolling instance upgrade¶

Problem¶

Upgrading a stateful database fleet (new version, new instance class, new kernel) without downtime + without the 2× cost penalty of blue/green.

Solution¶

Replace fleet units one at a time under a proxy tier that routes around the in-flight replacement. Each unit is drained, replaced with the new version, brought back into service, then the next unit begins. At any moment only a small fraction of the fleet is mid-upgrade, so availability is preserved without fleet duplication.

Three ingredients compose the pattern at the database tier:

Unit-of-replacement granularity — a tablet (Kubernetes pod running mysqld + vttablet sidecar) is small enough that replacing one at a time is acceptable but large enough that the fleet-wide upgrade completes in reasonable time.
Proxy tier routing around unavailable units — vtgate routes traffic away from tablets that are draining / replacing / warming, so clients never see the individual replacements.
Automatic tablet replacement on failure — "If a tablet goes down for any reason, our systems automatically reroute traffic to a functional tablet and allocate another tablet to replace the downed instance" (Morrison II, 2024). The same substrate that handles unplanned failures handles planned upgrades.

Canonical implementation¶

PlanetScale on [[systems/ vitess|Vitess]] on Kubernetes. Use cases:

Instance-class resizing — customer picks new instance type, backend rolls through fleet. "This allows your applications to continue to operate without being taken offline."
MySQL version upgrades — validated centrally, rolled through fleet; no maintenance window.
Kernel / OS upgrades — tablets are Kubernetes pods; pod replacement is the upgrade primitive.

Trade-offs¶

Accepts:

Mixed-version fleet state during upgrade — requires upgrades be backward-compatible across all tablets in flight simultaneously. This constrains what version-upgrade shapes are safe under rolling.
Per-unit connection drain — clients on the replaced tablet see their connection closed; must reconnect (proxy handles routing).
Fleet-wide completion time is longer than a blue/green cutover — rolling through 32 tablets one at a time takes longer than one coordinated switchover.

Gains:

~1× fleet cost (vs 2× for blue/green).
No coordinated fleet-wide connection drop — only the one draining tablet's connections.
No maintenance window — upgrades run continuously under normal traffic.
Unplanned-failure + planned-upgrade path unified — same substrate handles both.

Contrast: blue/green alternative¶

patterns/blue-green-database-deployment is the coordinated-switchover alternative: 2× fleet cost during upgrade, single coordinated cutover with all-connections drop, but no mixed-version state during upgrade. Blue/green wins on isolation; rolling wins on cost + cutover smoothness.

Seen in¶

— Brian Morrison II (PlanetScale, 2024-02-02). Canonical wiki pattern disclosure of rolling upgrades at the database tier under Vitess + Kubernetes. Contrasted against Aurora blue/green on cost, cutover disruption, and revert path.

concepts/rolling-upgrade — the concept page.
concepts/blue-green-deployment — the alternative.
patterns/blue-green-database-deployment — canonical alternative pattern.
systems/vitess — the proxy + tablet substrate.
systems/kubernetes — the orchestration substrate.