PATTERN Cited by 1 source
Hot-swap retrofit (fleet upgrade in flight)¶
Intent¶
Upgrade the live production fleet in place, one server at a time, without customer-visible disruption — even when the upgrade involves physical hardware (not just config or software). The canonical line: "converting a propeller aircraft to a jet while it was in flight."
Context¶
You have a large installed base of servers running customer workloads under an SLA. You discover a higher-performing hardware component (faster media, new NIC, offload card). Two naive options:
- Full fleet replacement. Field-replace every server — prohibitive cost, long timeline, and you carry two fleets until done.
- New-only rollout. Only new hardware gets the upgrade; existing workloads stay slow forever — or you force-migrate them, which is itself disruptive.
Hot-swap retrofit is the third path: put the new hardware into the existing chassis and let software start using it.
Mechanism (EBS 2013 SSD retrofit, as narrated by Marc Olson)¶
- Identify chassis headroom. In EBS's HDD storage servers, the only slot that didn't disrupt cooling airflow was between the motherboard and the fans.
- Find a mechanical solution to fix loose hardware. SSDs are light but can't be loose. EBS used industrial-strength, heat-resistant hook-and-loop fastening tape (worked with material scientists to find the right formulation).
- Design software to use the new component as a staging tier, not a replacement. Writes land on the SSD first; ack-on-SSD is returned to the application; async flush to HDD happens in the background. This preserves the HDD's durability/capacity role while the SSD absorbs write latency.
- Do it server-by-server, over months, with zero customer-visible disruption.
Prerequisites¶
- patterns/nondisruptive-migration — the system must already support migrating live tenants off a server before you touch it.
- Software architecture that can treat the new component as an optional accelerant, not as a schema-change. (A write-staging SSD fits. A fundamentally different media shape would not.)
- Serviceability built in from day one. Olson is explicit: "we designed our system from the start with non-disruptive maintenance events in mind."
Compounding benefit¶
The same mechanism (retarget volumes; rebuild empty servers; upgrade software) paid off across every subsequent EBS hardware transition: new storage-server types, systems/nitro offload, systems/aws-nitro-ssd. Some EBS volumes from the first few months of 2008 are still live after crossing hundreds of underlying servers and multiple hardware generations.
When to use¶
- Large fleet, expensive to replace wholesale.
- New component is a net-additive accelerant compatible with existing data flow.
- You can afford an in-person, at-scale, physical operation (staff time, logistics).
When not to use¶
- Upgrade changes fundamental correctness or consistency guarantees.
- New component requires a new data layout / migration the system can't hide from the tenant.
Seen in¶
- sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws — the 2013 SSD-into-every-HDD-server retrofit at EBS; photo of the tape job included in the post.