SYSTEM Cited by 1 source
SRD (Scalable Reliable Datagram)¶
Definition¶
SRD (Scalable Reliable Datagram) is AWS's data-center transport protocol โ designed to replace TCP on internal paths where TCP's general-internet assumptions aren't earning their cost. First published as "A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC" (2020). Built to be easily offloaded into hardware (Nitro) and to exploit AWS-owned data-center topology.
Design choices (from the EBS post)¶
The EBS team laid the SRD foundation in 2014 with two key observations:
- "We didn't need to design for the general internet." We own the DC network; tune for it.
- Storage IOs in flight can be reordered. There is no requirement that the network preserve transmit order โ barriers can be handled at the client before sending.
From these drop out the signature SRD behaviors:
- Multiple network paths. A single logical flow is fanned across many physical paths, smoothing load and reducing overflow at intermediate switches.
- No strict in-order delivery. Requests execute on arrival; clients sequence when needed.
- Failure-aware recovery. Rapid reroute around lossy or failing paths.
- Hardware-offload-friendly. Runs on Nitro cards, not on hypervisor/guest CPUs.
Counterintuitive finding mentioned¶
The earlier TCP-tuning work uncovered that adding a small amount of random latency to requests toward storage servers reduced average latency and outliers, because of the smoothing effect on the network. That tuning had a short half-life as scale kept moving; SRD is the architectural answer to the same problem.
Also useful outside storage¶
Exposed to EC2 guest networking as systems/ena-express, where it transparently accelerates guest TCP by riding on SRD's multi-path / lower-queue properties.
Seen in¶
- sources/2024-08-22-allthingsdistributed-continuous-reinvention-block-storage-at-aws โ origin, motivation, and the observation that storage IOs can be reordered.