Skip to content

PATTERN Cited by 1 source

Hot-swap retrofit (fleet upgrade in flight)

Intent

Upgrade the live production fleet in place, one server at a time, without customer-visible disruption — even when the upgrade involves physical hardware (not just config or software). The canonical line: "converting a propeller aircraft to a jet while it was in flight."

Context

You have a large installed base of servers running customer workloads under an SLA. You discover a higher-performing hardware component (faster media, new NIC, offload card). Two naive options:

  • Full fleet replacement. Field-replace every server — prohibitive cost, long timeline, and you carry two fleets until done.
  • New-only rollout. Only new hardware gets the upgrade; existing workloads stay slow forever — or you force-migrate them, which is itself disruptive.

Hot-swap retrofit is the third path: put the new hardware into the existing chassis and let software start using it.

Mechanism (EBS 2013 SSD retrofit, as narrated by Marc Olson)

  1. Identify chassis headroom. In EBS's HDD storage servers, the only slot that didn't disrupt cooling airflow was between the motherboard and the fans.
  2. Find a mechanical solution to fix loose hardware. SSDs are light but can't be loose. EBS used industrial-strength, heat-resistant hook-and-loop fastening tape (worked with material scientists to find the right formulation).
  3. Design software to use the new component as a staging tier, not a replacement. Writes land on the SSD first; ack-on-SSD is returned to the application; async flush to HDD happens in the background. This preserves the HDD's durability/capacity role while the SSD absorbs write latency.
  4. Do it server-by-server, over months, with zero customer-visible disruption.

Prerequisites

  • patterns/nondisruptive-migration — the system must already support migrating live tenants off a server before you touch it.
  • Software architecture that can treat the new component as an optional accelerant, not as a schema-change. (A write-staging SSD fits. A fundamentally different media shape would not.)
  • Serviceability built in from day one. Olson is explicit: "we designed our system from the start with non-disruptive maintenance events in mind."

Compounding benefit

The same mechanism (retarget volumes; rebuild empty servers; upgrade software) paid off across every subsequent EBS hardware transition: new storage-server types, systems/nitro offload, systems/aws-nitro-ssd. Some EBS volumes from the first few months of 2008 are still live after crossing hundreds of underlying servers and multiple hardware generations.

When to use

  • Large fleet, expensive to replace wholesale.
  • New component is a net-additive accelerant compatible with existing data flow.
  • You can afford an in-person, at-scale, physical operation (staff time, logistics).

When not to use

  • Upgrade changes fundamental correctness or consistency guarantees.
  • New component requires a new data layout / migration the system can't hide from the tenant.

Seen in

Last updated · 200 distilled / 1,178 read