Skip to content

PATTERN Cited by 1 source

SLO-aware early response

SLO-aware early response is the server-side pattern of stopping work mid-request — flushing a partial response plus a continuation token — when the server projects it will miss the request's end-to-end latency SLO if it keeps going. Instead of letting the client time out on a fully-processed-but-late response (which converts completed work into a wasted attempt), the server makes the choice to respond early and hand responsibility back to the caller.

The problem

Byte-size-pagination systems (see concepts/byte-size-pagination) promise their callers a predictable latency SLO per page. But real-world page fills are lumpy:

  • A page may hit a partition with thousands of small items — no single item is slow, but collectively they take too long.
  • Adaptive pagination (see concepts/adaptive-pagination) helps the typical case but can't fix adversarial distributions.
  • A backing-store slow patch (GC pause, compaction spike) can delay even normal pages.

Without the pattern: the server keeps grinding, the client eventually hits its RPC deadline and returns a timeout error, and all the work the server just did is wasted — plus the client has no partial results to resume from.

The Netflix KV DAL instance

"In addition to adaptive pagination, a mechanism is in place to send a response early if the server detects that processing the request is at risk of exceeding the request's latency SLO."

"For example, let us assume a client submits a GetItems request with a per-page limit of 2 MiB and a maximum end-to-end latency limit of 500ms. While processing this request, the server retrieves data from the backing store. This particular record has thousands of small items so it would normally take longer than the 500ms SLO to gather the full page of data. If this happens, the client would receive an SLO violation error, causing the request to fail even though there is nothing exceptional. To prevent this, the server tracks the elapsed time while fetching data. If it determines that continuing to retrieve more data might breach the SLO, the server will stop processing further results and return a response with a pagination token." (Source: sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer)

Why the client gets to make the next call

The server hands back a page token + whatever results it has so far. The client — with its full business-level context — decides:

  • Continue immediately? If the user is actively paginating a list.
  • Accept partial + stop? If the first page's contents are enough to answer the user's question.
  • Back off + retry later? If the deadline miss was a flaky- backend symptom and the calling service has time.

Combined with gRPC deadlines on the client side, the client "is smart enough not to issue further requests, reducing useless work" — deadlines propagate through the RPC chain, so upstream callers that no longer need the data don't trigger work on the downstream DAL.

How the server decides

From the post, the key quantities are:

  • Elapsed time so far (measured since request started).
  • Estimated time to finish the page (from adaptive-pagination state + what's been fetched).
  • Request's end-to-end SLO (from the request deadline or the server-signaled target).

If elapsed + estimated_remaining > SLO_budget, return early. The specific heuristic isn't given in the post, but the primitive signals are clear.

Trade-offs

  • Clients must handle partial pages as normal. Callers that assume "a page has my whole target set" or that discount the next_page_token for small-looking pages will misbehave. The API shape must make this non-optional.
  • Server does some wasted work — the early-return path flushes whatever it's collected, but there's still a cost in serialisation and I/O to that point.
  • False positives waste page calls. If the time-remaining estimator over-predicts, the server cuts short when it didn't need to; the client pays the extra round trip.
  • Needs accurate deadline propagation. If the server doesn't know the client's deadline (no grpc-timeout header), it can only use its own default SLO — missing the nuance where a higher-tier RPC has tighter budget.
  • Incompatible with "all or nothing" semantics. If a caller must have the full result to make progress (rare in pagination, common in transactional reads), early return just shifts the timeout-vs-partial trade.

Sibling patterns

  • Shedding load — server returns 429/503 when it knows it can't finish. Discards request-level; this pattern discards work-level within one request.
  • Server-side deadlines (gRPC ctx.Done()) — the kernel primitive. SLO-aware early response is a structured, user-visible application of it.
  • Speculative execution with cancel-on-return — client issues hedged calls, cancels losers. Different tactic, same tail- latency goal.

Seen in

Last updated · 319 distilled / 1,201 read