CONCEPT Cited by 1 source
"continue" prompt for truncated output¶
Definition¶
The "continue"-prompt primitive is the idiom of sending
the literal string "continue" on the same conversation turn
when an LLM response is cut off at the output-token limit,
and concatenating the continuation with the previous partial
output. The conversation-API layer preserves the model's
in-flight state so the continuation picks up mid-sentence
without re-emitting anything.
Why it works¶
LLM responses are generated token by token until either the
model emits a stop token or the max_tokens limit is hit.
At the limit, the response is truncated mid-output — the
model doesn't know it ran out of budget.
On the next turn with "continue" as the user message, the model sees its own previous assistant message (the partial output) in the chat history and — because models are trained to continue coherent-looking text — resumes generation from roughly where it left off. The client concatenates the second turn's output onto the first and treats the union as a single response.
Why "continue" beats elaborate re-prompting¶
Counter-intuitive finding from Zalando:
"As per our tests, a simple 'continue' prompt proved more reliable than more complex prompts to continue the transformation." (Source: sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries)
Why: a more elaborate prompt ("please continue from where you left off, and make sure to emit the rest of the transformed button component, without repeating what you already wrote") adds ambiguity and re-opens decoding choices. The model may:
- Try to re-summarise what was already written (duplicating output).
- Re-interpret the task in light of the new instructions (changing style mid-stream).
- Refuse if the elaborate prompt phrases things as a new task ("I can't continue; please restart").
The minimal nudge — "continue" — signals nothing but "keep going" and lets the trained-completion prior do the work.
Zalando's application¶
The Zalando Partner Tech team hit the 4K output-token
limit on GPT-4o for files exceeding a certain length.
Instead of asking the model to stream or return partial JSON,
they used the
llm library's
conversation API
and a while truncated: send("continue") loop around the
per-file transformation. The output is concatenated; the
downstream parser sees a single <updatedContent>…</updatedContent>
block.
Limits / failure modes¶
- Repetition. Sometimes the model re-emits the last few
tokens of the previous response before continuing. Zalando
does not describe a de-dup pass, implying this was not
common enough to be a reliability issue under the
<updatedContent>fencing contract (the parser finds the closing tag regardless). - State drift on very long chains. Two or three continuations were fine; pathological cases of 10+ continuations would bring more drift. Not quantified in the article.
- Not a substitute for larger budgets. If the overwhelming majority of files need a continuation, the right response is to split the work (smaller input files, smaller scope per call), not repeatedly re-pay the multi-turn cost.
- Provider-specific. Some providers' chat APIs do not preserve response state across user messages the way OpenAI's does. The primitive is simplest on providers with stateful conversation semantics.
Relation to other primitives¶
- vs streaming. Streaming returns the response as it's
being generated; it doesn't change the
max_tokenscap.continueis a recovery mechanism after the cap is hit. - vs re-running with higher
max_tokens. Raising the limit (if the model supports it) avoids truncation at cost of longer single-request latency and possibly higher per-request cost.continuekeeps per-request size small and only pays the continuation cost when needed. - vs JSON mode / structured output. Structured-output
APIs still truncate at the token limit;
continueapplies there too, but the intermediate partial is no longer valid JSON and needs a terminal-state assembler.
Seen in¶
- sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries — canonical wiki instance. Zalando's Component Migration Toolkit uses it as the 4K-output-token recovery primitive.
Related¶
- concepts/context-window-as-token-budget — parent budget concept
- systems/python-llm-library — the library whose conversation API made this idiom expressible in a few lines
- systems/openai-api — the provider whose Chat Completions endpoint supports the multi-turn state this relies on
- patterns/llm-only-code-migration-pipeline — the pattern Zalando wraps this primitive inside
- companies/zalando