SYSTEM Cited by 1 source

Homa transport protocol¶

Homa is a proposed TCP replacement for datacenter RPC workloads, designed by John Ousterhout's group at Stanford. The 2022 position paper "It's Time to Replace TCP in the Datacenter" (arXiv 2210.00714) is surfaced in the High Scalability Dec-2022 roundup.

Thesis¶

From the paper's abstract:

"TCP's problems are too fundamental and interrelated to be fixed; the only way to harness the full performance potential of modern networks is to introduce a new transport protocol into the datacenter. Homa demonstrates that it is possible to create a transport protocol that avoids all of TCP's problems."

The argument: TCP was designed for the wide-area Internet (unknown topology, long RTTs, diverse middleboxes, byte-stream semantics). Datacenter networks are different enough that paying the cost of TCP's assumptions is no longer sensible.

Key design choices¶

Message-oriented, not byte-stream — each RPC is a distinct message, not a chunk of a TCP stream.
Receiver-driven congestion control — the receiver pulls data rather than the sender pushing speculatively.
Multipath — exploit the multiple parallel paths typically available in a datacenter fabric, rather than pinning one flow to one path.
No in-order delivery across messages — ordering is per-message, not per-connection, eliminating head-of-line blocking.

Parallel industry work (confirming the thesis direction)¶

AWS SRD — Scalable Reliable Datagram — Amazon's non-TCP datacenter transport, running on Nitro dedicated hardware. Reduces EBS tail latency. Multipath, microsecond-level retries, optimized for performance over reliability because "a datacenter isn't the internet."
Google Snap — microkernel-approach host networking.
Google Aquila — unified low-latency datacenter fabric.
Google CliqueMap — RMA-based distributed cache that bypasses the OS kernel for network I/O.

What the wiki cares about¶

Homa is a research position, not a production outcome. No major datacenter has displaced TCP fleet-wide as of 2022.
The parallel production systems at AWS and Google validate the direction: datacenters are actively building non-TCP transports for specific workloads (storage, distributed cache, RPC).
The transferable concept is concepts/multi-path-datacenter-transport: the assumptions that make TCP good on the open Internet (unknown topology, diverse middleboxes, single-path flows) are not datacenter assumptions.

Seen in¶

sources/2022-12-02-highscalability-stuff-the-internet-says-on-scalability-for-december-2nd-2022