Skip to content

PATTERN Cited by 1 source

Buffered resumable inference stream

Definition

Buffered resumable inference stream is the pattern of having an AI gateway buffer a streaming LLM response as it's generated, independently of the caller's lifetime, so that a caller whose connection / session / process is interrupted mid-stream can reconnect and retrieve the remaining response from the buffer — without re-inferencing, without paying twice for the same output tokens, and without user-visible restart latency.

It is the concrete realisation of concepts/resilient-inference-stream: the stream lives on the gateway; the caller is a (possibly transient) consumer of the buffer.

Mechanism

  1. Caller issues an inference request with streaming enabled.
  2. Gateway opens the upstream provider stream, reads tokens as they arrive, writes them to a gateway-side buffer, and forwards to the caller.
  3. If the caller disconnects:
  4. Gateway continues consuming from the upstream, appending to the buffer.
  5. The stream completes or errors on the upstream regardless of caller presence.
  6. If the caller reconnects with the stream's identifier:
  7. Gateway replays buffered tokens from the last acknowledged offset (or from 0, depending on protocol).
  8. Live tokens merge into the replay as they continue to arrive.
  9. When the stream ends, the caller sees a complete response; billing charges once for the underlying inference.

Why it matters — the agent disconnect cost

For a long-running agent whose single reasoning turn issues a large-prompt inference and expects seconds of streamed output:

  • Without the pattern: a caller crash restarts the inference. The full prompt (potentially millions of tokens in agent workflows) is re-prefilled, first-token TTFT happens again, and the full output is re-billed. Worst case for 10-step agents: one crash during any step can 2×-bill the step.
  • With the pattern: a caller crash is opaque to billing and latency. The gateway still has the stream; the restarted caller attaches, consumes the buffered portion, continues live. Paired with caller-side durable execution / checkpoint resume, "the end user never notices" (2026-04-16 post).

The lifetime-decoupling invariant

The pattern's load-bearing property is the stream's lifetime is decoupled from the caller's lifetime. This inverts the naive TCP-stream model where a disconnect means the stream is lost. Three invariants:

  1. Gateway owns the stream. The caller is a view-mechanism onto gateway state, not the stream's owner.
  2. Buffer durability ≥ caller lifetime. The buffer has to outlive the caller's interruption; implementation detail includes TTL, persistence (memory vs disk vs object store), and eviction.
  3. Reconnect identity. The reconnecting caller has to identify its in-progress stream — via a request ID, conversation token, or similar correlation key. The post doesn't describe the wire format but the property requires it.

Pairs with Agents SDK checkpointing

The 2026-04-16 post stitches this pattern explicitly to caller-side checkpointing:

"Combined with the Agents SDK's built-in checkpointing, the end user never notices."

The two tiers together:

  • Caller-side checkpointing — agent workflow state serialises after each reasoning step. Restart-after-crash resumes from the last checkpoint.
  • Gateway-side stream buffering (this pattern) — in-flight inference calls survive the caller's restart and feed the resumed workflow.

End-to-end, the agent becomes crash-resilient mid-inference, not just between-inferences — a qualitatively stronger guarantee than step-level checkpointing alone.

Trade-offs

  • Gateway memory / storage footprint. Buffering full streaming responses at catalog scale (~20M requests/month and up) is non-trivial. Retention window, buffer size caps, and compression / spill-to-disk policies are the pricing knobs. The 2026-04-16 post doesn't disclose these.
  • Retention-window trade-off. Too-short TTL → legitimate reconnects find empty buffers; too-long TTL → storage cost dominates. Likely tied to request-timeout semantics.
  • Streaming protocol constraints. The pattern requires the caller-to-gateway protocol to support reconnect-with-resume- offset (SSE with Last-Event-ID, chunked HTTP with a range header, or bespoke). Naive SSE without Last-Event-ID only supports full-replay, not partial.
  • Interaction with failover. Failover mid-stream is subtle: if provider A emitted 100 tokens then failed, does the gateway hand provider B the remaining work from those 100 tokens' state? Or does it start over and replay only the final merged stream to the caller? The post describes the happy path without productising this interaction.
  • Observability. A stream "completing" on the gateway decouples from a caller "receiving" it — an apparent success from the gateway's point of view could be an incomplete consumption from the caller's point of view. Gateway-side monitoring has to distinguish stream-generated from stream-fully-delivered.

Cost semantics

The pattern's economic invariant: output tokens are billed once per inference, regardless of the number of reconnect attempts. This is the caller's core benefit — otherwise a buggy reconnect loop would silently amplify bills.

For the gateway operator, this means:

  • The upstream provider bills the gateway once (per inference).
  • The gateway bills the caller once (per logical request).
  • Reconnect attempts are free for the caller, cost the gateway buffer-storage only.

Seen in

Contrast with sibling patterns

Last updated · 200 distilled / 1,178 read