PATTERN Cited by 1 source
Streaming output rewrite¶
Pattern¶
Manipulate an LLM's token stream as it is being emitted, applying find-and-replace rules, long-token compression, and embedding-resolved symbol rewriting — so that by the time the stream reaches the user (or the downstream renderer), all known-correctable failures have already been corrected. Critical property: "the user never sees an intermediate incorrect state."
Canonical Vercel framing (LLM Suspense)¶
Vercel names the pattern LLM Suspense in v0:
"LLM Suspense is a framework that manipulates text as it streams to the user. This includes actions like find-and-replace for cleaning up incorrect imports, but can become much more sophisticated... Because this happens during streaming, the user never sees an intermediate incorrect state."
(Source: sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent)
Two disclosed applications¶
1. Long-token compression (URL shortening)¶
- When the user uploads an attachment, v0 has a blob- storage URL that can be "hundreds of characters" long — "10s of tokens" per reference.
- Before the LLM sees it: replace the long URL with a short placeholder in the system prompt / tool results.
- After the LLM emits it: expand the short placeholder back to the full URL in the streamed output.
- Net effect: model reads and writes fewer tokens, cutting cost and latency.
2. Embedding-based symbol rewriting (icon names)¶
For a library whose namespace churns (weekly for systems/lucide-react), the LLM frequently emits icon names that don't exist. Streaming rewrite:
- Embed all live library exports in a vector DB.
- Analyse actual library exports at runtime.
- If emitted symbol exists → pass through.
- If not → embedding-search for nearest real export.
- Rewrite the import line during streaming.
Completes in <100 ms per substitution, no further model calls required (the rewrite is deterministic — it's an embedding lookup, not an LLM call).
Worked example: model emits
import { VercelLogo } from 'lucide-react' →
Suspense emits
import { Triangle as VercelLogo } from 'lucide-react'.
"In production, these simple rules handle variations in quoting, formatting, and mixed import blocks. Because this happens during streaming, the user never sees an intermediate incorrect state."
Design properties¶
-
Zero additional model calls. The rewrite layer is deterministic — find-and-replace or nearest-neighbour lookup. No small summariser, no secondary LLM. Latency stays in the microsecond-to-ms range.
-
UX invariant: no visible intermediate broken state. The rewrite happens before the token leaves the server, not after. This separates the pattern from post-stream autofixers (patterns/deterministic-plus-model-autofixer) which run once generation is complete.
-
Rule-composition latency bounded. Each rule applied must stay in the per-token latency budget; a rule that required a 500 ms synchronous lookup would stall the stream. Pre-embedding + in-memory vector lookup keeps per-substitution cost under 100 ms.
-
Pre- and post- symmetry for compression. Long-URL compression is bidirectional: shorten before the model sees it (save input tokens), expand after it emits (output is user-facing). The substitution map lives on the server side only.
When to use¶
- Failure modes that are detectable mid-stream — known bad import syntax, impossible-symbol references, known-long tokens. If detection requires seeing the whole artifact (cross-file invariants), use a post- stream autofixer instead.
- Corrections that are deterministic — lookup, find-and-replace, embedding nearest-neighbour. If the correction requires judgment, use a fine-tuned model in a post-stream stage.
- Targets where UX invariants require no flicker — live preview renderers (v0), code editors, chat UIs that render incrementally.
When not to use¶
- Cross-file invariants — "these often involve changes across multiple files or require analyzing the abstract syntax tree (AST)." Streaming rewrite can't see the full generated program at token-time; post- stream autofixers handle this.
- Corrections requiring model judgment — e.g.
"where should the
QueryClientProviderbe wrapped?" — this needs a placement decision, not a symbol swap. - Unbounded-per-token rewrite rules — if a rule's latency isn't bounded, it stalls the stream.
Seen in¶
- sources/2026-01-08-vercel-how-we-made-v0-an-effective-coding-agent — canonical LLM Suspense disclosure; two applications (URL shortening, icon resolution); <100 ms per substitution; explicit UX invariant "user never sees an intermediate incorrect state."
Related¶
- systems/vercel-v0 — canonical consumer.
- systems/lucide-react — canonical icon-target.
- patterns/embedding-based-name-resolution — the more specific pattern this one uses for icon fixes.
- patterns/composite-model-pipeline — this is stage 3 of the v0 composite pipeline.
- concepts/llm-icon-hallucination — the failure mode the icon-rewrite application fixes.
- patterns/deterministic-plus-model-autofixer — the post-stream sibling; streaming rewrite handles local corrections, the autofixer handles cross-file AST-level ones.