Skip to content

PATTERN Cited by 1 source

Streaming output rewrite

Pattern

Manipulate an LLM's token stream as it is being emitted, applying find-and-replace rules, long-token compression, and embedding-resolved symbol rewriting — so that by the time the stream reaches the user (or the downstream renderer), all known-correctable failures have already been corrected. Critical property: "the user never sees an intermediate incorrect state."

Canonical Vercel framing (LLM Suspense)

Vercel names the pattern LLM Suspense in v0:

"LLM Suspense is a framework that manipulates text as it streams to the user. This includes actions like find-and-replace for cleaning up incorrect imports, but can become much more sophisticated... Because this happens during streaming, the user never sees an intermediate incorrect state."

(Source: sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent)

Two disclosed applications

1. Long-token compression (URL shortening)

  • When the user uploads an attachment, v0 has a blob- storage URL that can be "hundreds of characters" long — "10s of tokens" per reference.
  • Before the LLM sees it: replace the long URL with a short placeholder in the system prompt / tool results.
  • After the LLM emits it: expand the short placeholder back to the full URL in the streamed output.
  • Net effect: model reads and writes fewer tokens, cutting cost and latency.

2. Embedding-based symbol rewriting (icon names)

For a library whose namespace churns (weekly for systems/lucide-react), the LLM frequently emits icon names that don't exist. Streaming rewrite:

  1. Embed all live library exports in a vector DB.
  2. Analyse actual library exports at runtime.
  3. If emitted symbol exists → pass through.
  4. If not → embedding-search for nearest real export.
  5. Rewrite the import line during streaming.

Completes in <100 ms per substitution, no further model calls required (the rewrite is deterministic — it's an embedding lookup, not an LLM call).

Worked example: model emits import { VercelLogo } from 'lucide-react' → Suspense emits import { Triangle as VercelLogo } from 'lucide-react'.

"In production, these simple rules handle variations in quoting, formatting, and mixed import blocks. Because this happens during streaming, the user never sees an intermediate incorrect state."

Design properties

  1. Zero additional model calls. The rewrite layer is deterministic — find-and-replace or nearest-neighbour lookup. No small summariser, no secondary LLM. Latency stays in the microsecond-to-ms range.

  2. UX invariant: no visible intermediate broken state. The rewrite happens before the token leaves the server, not after. This separates the pattern from post-stream autofixers (patterns/deterministic-plus-model-autofixer) which run once generation is complete.

  3. Rule-composition latency bounded. Each rule applied must stay in the per-token latency budget; a rule that required a 500 ms synchronous lookup would stall the stream. Pre-embedding + in-memory vector lookup keeps per-substitution cost under 100 ms.

  4. Pre- and post- symmetry for compression. Long-URL compression is bidirectional: shorten before the model sees it (save input tokens), expand after it emits (output is user-facing). The substitution map lives on the server side only.

When to use

  • Failure modes that are detectable mid-stream — known bad import syntax, impossible-symbol references, known-long tokens. If detection requires seeing the whole artifact (cross-file invariants), use a post- stream autofixer instead.
  • Corrections that are deterministic — lookup, find-and-replace, embedding nearest-neighbour. If the correction requires judgment, use a fine-tuned model in a post-stream stage.
  • Targets where UX invariants require no flicker — live preview renderers (v0), code editors, chat UIs that render incrementally.

When not to use

  • Cross-file invariants"these often involve changes across multiple files or require analyzing the abstract syntax tree (AST)." Streaming rewrite can't see the full generated program at token-time; post- stream autofixers handle this.
  • Corrections requiring model judgment — e.g. "where should the QueryClientProvider be wrapped?" — this needs a placement decision, not a symbol swap.
  • Unbounded-per-token rewrite rules — if a rule's latency isn't bounded, it stalls the stream.

Seen in

Last updated · 476 distilled / 1,218 read