Skip to content

SYSTEM Cited by 1 source

SRD (Scalable Reliable Datagram)

Definition

SRD (Scalable Reliable Datagram) is AWS's data-center transport protocol โ€” designed to replace TCP on internal paths where TCP's general-internet assumptions aren't earning their cost. First published as "A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC" (2020). Built to be easily offloaded into hardware (Nitro) and to exploit AWS-owned data-center topology.

Design choices (from the EBS post)

The EBS team laid the SRD foundation in 2014 with two key observations:

  1. "We didn't need to design for the general internet." We own the DC network; tune for it.
  2. Storage IOs in flight can be reordered. There is no requirement that the network preserve transmit order โ€” barriers can be handled at the client before sending.

From these drop out the signature SRD behaviors:

  • Multiple network paths. A single logical flow is fanned across many physical paths, smoothing load and reducing overflow at intermediate switches.
  • No strict in-order delivery. Requests execute on arrival; clients sequence when needed.
  • Failure-aware recovery. Rapid reroute around lossy or failing paths.
  • Hardware-offload-friendly. Runs on Nitro cards, not on hypervisor/guest CPUs.

Counterintuitive finding mentioned

The earlier TCP-tuning work uncovered that adding a small amount of random latency to requests toward storage servers reduced average latency and outliers, because of the smoothing effect on the network. That tuning had a short half-life as scale kept moving; SRD is the architectural answer to the same problem.

Also useful outside storage

Exposed to EC2 guest networking as systems/ena-express, where it transparently accelerates guest TCP by riding on SRD's multi-path / lower-queue properties.

Seen in

Last updated ยท 200 distilled / 1,178 read