Skip to content

PATTERN Cited by 1 source

Rolling instance upgrade

Problem

Upgrading a stateful database fleet (new version, new instance class, new kernel) without downtime + without the 2× cost penalty of blue/green.

Solution

Replace fleet units one at a time under a proxy tier that routes around the in-flight replacement. Each unit is drained, replaced with the new version, brought back into service, then the next unit begins. At any moment only a small fraction of the fleet is mid-upgrade, so availability is preserved without fleet duplication.

Three ingredients compose the pattern at the database tier:

  1. Unit-of-replacement granularity — a tablet (Kubernetes pod running mysqld + vttablet sidecar) is small enough that replacing one at a time is acceptable but large enough that the fleet-wide upgrade completes in reasonable time.

  2. Proxy tier routing around unavailable unitsvtgate routes traffic away from tablets that are draining / replacing / warming, so clients never see the individual replacements.

  3. Automatic tablet replacement on failure"If a tablet goes down for any reason, our systems automatically reroute traffic to a functional tablet and allocate another tablet to replace the downed instance" (Morrison II, 2024). The same substrate that handles unplanned failures handles planned upgrades.

Canonical implementation

PlanetScale on [[systems/ vitess|Vitess]] on Kubernetes. Use cases:

  • Instance-class resizing — customer picks new instance type, backend rolls through fleet. "This allows your applications to continue to operate without being taken offline."
  • MySQL version upgrades — validated centrally, rolled through fleet; no maintenance window.
  • Kernel / OS upgrades — tablets are Kubernetes pods; pod replacement is the upgrade primitive.

Trade-offs

Accepts:

  • Mixed-version fleet state during upgrade — requires upgrades be backward-compatible across all tablets in flight simultaneously. This constrains what version-upgrade shapes are safe under rolling.
  • Per-unit connection drain — clients on the replaced tablet see their connection closed; must reconnect (proxy handles routing).
  • Fleet-wide completion time is longer than a blue/green cutover — rolling through 32 tablets one at a time takes longer than one coordinated switchover.

Gains:

  • ~1× fleet cost (vs 2× for blue/green).
  • No coordinated fleet-wide connection drop — only the one draining tablet's connections.
  • No maintenance window — upgrades run continuously under normal traffic.
  • Unplanned-failure + planned-upgrade path unified — same substrate handles both.

Contrast: blue/green alternative

patterns/blue-green-database-deployment is the coordinated-switchover alternative: 2× fleet cost during upgrade, single coordinated cutover with all-connections drop, but no mixed-version state during upgrade. Blue/green wins on isolation; rolling wins on cost + cutover smoothness.

Seen in

Last updated · 470 distilled / 1,213 read