CONCEPT Cited by 1 source
Resilient inference stream¶
Definition¶
A resilient inference stream is an LLM-inference response stream whose lifetime is decoupled from the lifetime of the caller that issued it. If the caller (agent, browser tab, Durable Object instance) disconnects mid-stream — crash, tab close, DO restart, network partition — the stream continues generating on the provider side and is buffered at an intermediary (the gateway) so the caller can reconnect and resume retrieving tokens from where it left off, without re-inferencing and without double-paying for output tokens.
Why it matters — the agent-reliability budget¶
The 2026-04-16 Cloudflare AI Platform post frames the argument explicitly:
"An agent might chain ten calls together to complete a single task and suddenly, a single slow provider doesn't add 50ms, it adds 500ms. One failed request isn't a retry, but suddenly a cascade of downstream failures."
A 10-step agent amplifies per-step failure probability by 10×. If the substrate is "restart the step on any disconnect" — the classic retry model — each retry pays full inference cost again (potentially millions of prompt tokens for long conversations) and the user-visible TTFT doubles. The resilient-stream invariant cuts this: a disconnect does not invalidate the in-flight stream; re-attach and continue.
Three properties¶
- Disconnect-survival. The stream is produced against the provider regardless of whether the caller is still connected.
- Buffer-and-replay. The intermediary persists the emitted tokens as they stream, so a reconnecting caller can replay from offset 0 (or from a client-known offset) without a second inference call.
- Exactly-once billing. Because the stream was generated once, the caller pays once — even across N reconnect attempts.
Canonical mechanism — AI Gateway stream buffering¶
The 2026-04-16 post describes the Cloudflare instance:
"If you're building long-running agents with Agents SDK, your streaming inference calls are also resilient to disconnects. AI Gateway buffers streaming responses as they're generated, independently of your agent's lifetime. If your agent is interrupted mid-inference, it can reconnect to AI Gateway and retrieve the response without having to make a new inference call or paying twice for the same output tokens. Combined with the Agents SDK's built-in checkpointing, the end user never notices."
Two-tier pairing:
- Caller-side checkpointing — the Agents SDK (evolving toward Project Think's durable fibers) serialises agent state after each reasoning step, so a crash restarts where it was in the workflow, not from scratch.
- Gateway-side stream buffering — the AI Gateway holds the active inference stream so the restarted caller finds its stream still running.
Together they give end-to-end "the end user never notices" continuity across a caller crash.
Contrast with existing patterns¶
- Client-side retry on disconnect — the traditional pattern. Re-issues the full inference request. Pays 2× token cost, pays 2× latency (full prefill + first-token again), and is visible to the user as a latency spike. The resilient-stream pattern eliminates all three.
- Automatic provider failover — handles provider-side failures (upstream-outage-driven retry). Resilient streams handle caller-side failures (caller-crash-driven reconnect). The two compose: the gateway is the reliability substrate on both sides.
- Checkpointable workflow orchestration (Temporal, AWS Step Functions) — typically checkpoints at step boundaries (workflow resumes after a completed activity, not mid-activity). Resilient inference streams extend the checkpoint primitive into the middle of a single inference call, giving sub-step reconnect granularity that workflow engines on their own don't provide.
What the post doesn't specify¶
- Retention window. How long is a buffered stream retained? (Seconds? The full request timeout? Configurable?)
- Memory / storage cost. Retaining full streaming responses server-side at 241B-tokens/month-scale has non-trivial footprint; pricing and buffer-size caps aren't discussed.
- Multi-reconnect semantics. If an agent reconnects three times to the same stream, are all three retrievals idempotent?
- Upstream-terminated streams. If the upstream provider itself terminates the stream before first token (rate limit, context-window overflow), does the reconnect return the partial stream and the error, or do both the buffer and the agent see the failure?
- Interaction with failover. If the gateway fails over to a different provider mid-stream (provider 1 fails, provider 2 takes over), does the caller see two stream segments, or a unified continuation, or is failover prohibited once bytes have been emitted?
These are the open edges of the property; the post describes the happy path without productising the failure-mode space.
Seen in¶
- sources/2026-04-16-cloudflare-ai-platform-an-inference-layer-designed-for-agents — canonical introduction. AI Gateway buffers streams independently of agent lifetime; combined with Agents SDK checkpointing yields end-to-end reconnect invisibility.
Related¶
- patterns/buffered-resumable-inference-stream — the concrete pattern that realises this property.
- patterns/automatic-provider-failover — the provider-side sibling reliability primitive.
- concepts/durable-execution — the caller-side sibling primitive; together they give the "never notices" property.
- concepts/time-to-first-token — the metric preserved by avoiding re-inference.
- concepts/agent-context-window — reconnect-without-reissue matters more as prompts grow.
- systems/cloudflare-ai-gateway — the buffering intermediary.
- systems/cloudflare-agents-sdk — the caller-side pair.
- systems/project-think — the next-generation caller-side pair with explicit durable-execution.