CONCEPT Cited by 1 source

Serialization tax in proxy path¶

Serialization tax in proxy path is the latency cost imposed on every request by an in-path proxy that must deserialize + re-serialize the payload to make a routing decision. The tax is proportional to payload size, is paid per-hop, and is amplified at the tail: a proxy's p99 deserialize latency becomes a contributor to the overall p99 latency of every downstream service.

Canonicalised on the wiki by Netflix's 2026-05-01 Model-Serving post, where Switchboard's "10–20ms of latency due to serialization-deserialization operations, depending on payload size" is one of three named pains that drove the architecture shift to Lightbulb + Envoy.

The mechanism¶

A routing proxy typically needs to make a decision based on some subset of the request. If the routing-relevant data is in the payload (not headers), the proxy must:

Receive bytes from the client
Deserialize the payload into the proxy's in-memory representation (gRPC protobuf decode, JSON parse, etc.)
Inspect the relevant fields to decide where to route
Re-serialize the payload to forward to the backend
Send bytes to the backend

Steps (2) and (4) are the tax. Both scale with payload size. For small payloads the tax is small; for large payloads (images, feature vectors, long JSON blobs) it becomes the dominant in-proxy latency.

Netflix numbers¶

From the 2026-05-01 post:

10–20ms per request at Switchboard, depending on payload size.
"[Switchboard] further exposes a request to tail latency amplification" — the proxy's tail latency compounds with the backend's tail latency.
"[Payloads] could be quite large. Needing to deserialize and then re-serialize the payload as it flowed through Switchboard to make a routing decision was a significant contributor to latency and increased serving costs."

The architectural remedy¶

If the routing decision can be made from a small piece of metadata (e.g. a header, a cookie, a URL path), the serialization tax disappears. Two architectural moves enable this:

Route on headers, not payload. HTTP headers can be read without touching the payload; modern proxies (Envoy, Nginx, HAProxy) have zero-copy forwarding paths for header-routed requests.
Move routing-relevant decisions out of the proxy. A separate resolver service computes the routing key from request context + experiment state, then the client sets the header and hits the proxy directly with the payload.

Netflix's Lightbulb + Envoy architecture applies both moves: Lightbulb produces a routingKey (header) and an ObjectiveConfig (body), the client assembles both into the real request, and Envoy routes on the header without parsing the body. See systems/netflix-lightbulb and patterns/separate-routing-from-model-selection.

Generalised to other patterns¶

The serialization tax manifests in several other places on the wiki:

concepts/connection-multiplexing-over-http — if every HTTP/2 frame is re-framed at a proxy, the tax applies per-frame.
concepts/ssl-handshake-as-per-request-tax — sibling tax at the TLS layer; different mechanism, same "per-request overhead limited by connection reuse" framing.
concepts/zero-copy-data-sharing-protocol — the opposite end of the design space: formats designed to eliminate serde costs.
patterns/zero-copy-sendfile-broker — Kafka broker's optimisation to avoid the equivalent tax at the storage layer.

When the tax is acceptable¶

Not every in-path proxy suffers it. Cases where the serialization tax is small or amortised:

Small payloads. A proxy routing on 1 KB requests pays microseconds per hop.
Sticky connection reuse. gRPC / HTTP/2 streams can amortise some per-connection serialization state.
Routing decisions derivable from metadata only (URL path, headers, query string). If the proxy doesn't need to parse the body, the tax doesn't apply.

Netflix hit the tax because model-serving payloads can be large (feature vectors, lists of titles, full transaction objects) and the A/B-cell and context-aware routing decisions needed to inspect the body.

Seen in¶

sources/2026-05-01-netflix-state-of-routing-in-model-serving — canonical quantified instance at 10–20ms per request at Switchboard. One of the three named pains that drove the Lightbulb + Envoy architecture. The "separate model inputs from request metadata" design principle is directly motivated by this tax.