SYSTEM Cited by 1 source

AWS Scalable Reliable Datagram (SRD)¶

SRD is Amazon's in-house non-TCP datacenter transport protocol. Described in the 2022 AWS HPC blog post "In the search for performance, there's more than one way to build a network", surfaced in the High Scalability Dec-2022 roundup.

Design stance¶

"Focuses more on performance and less on reliability, because you know, a datacenter isn't the internet." (Paraphrase from the roundup's summary of the blog post.)

Ethernet-based transport — layered on standard DC Ethernet, not a new physical network.
Runs on dedicated hardware — offloaded to the Nitro card, not implemented in the host kernel.
Multipath — exploits multiple parallel paths through the datacenter fabric in parallel.
Microsecond-level retries — the retry timeout is set to the measured datacenter RTT (hundreds of microseconds), not to the WAN-TCP-tuned seconds.

Primary workload: EBS¶

The canonical SRD use case is EBS tail-latency reduction. The roundup's framing is important: "average latency doesn't matter for data." What matters is the P99 / P999 tail — because a single slow block read/write serializes the entire application waiting on the I/O. SRD targets the tail specifically.

Why it validates the Homa thesis¶

SRD is the clearest public evidence that the Homa "TCP is wrong for the datacenter" thesis direction matches real hyperscaler production experience. Amazon shipped a non-TCP transport — not as a research exercise, but as the default transport for EBS — because they measured TCP's tail-latency cost as real money.

Why it shows up on this wiki¶

TCP-in-datacenter debate — SRD is the production counterpoint to the Homa paper.
Nitro-card offload — canonical example of Nitro's role as a hardware extension point for AWS-internal network innovation.
EBS tail-latency engineering — documents one of the concrete levers AWS pulls to hit its EBS SLO.

Seen in¶

sources/2022-12-02-highscalability-stuff-the-internet-says-on-scalability-for-december-2nd-2022