Skip to content

CONCEPT Cited by 1 source

"continue" prompt for truncated output

Definition

The "continue"-prompt primitive is the idiom of sending the literal string "continue" on the same conversation turn when an LLM response is cut off at the output-token limit, and concatenating the continuation with the previous partial output. The conversation-API layer preserves the model's in-flight state so the continuation picks up mid-sentence without re-emitting anything.

Why it works

LLM responses are generated token by token until either the model emits a stop token or the max_tokens limit is hit. At the limit, the response is truncated mid-output — the model doesn't know it ran out of budget.

On the next turn with "continue" as the user message, the model sees its own previous assistant message (the partial output) in the chat history and — because models are trained to continue coherent-looking text — resumes generation from roughly where it left off. The client concatenates the second turn's output onto the first and treats the union as a single response.

Why "continue" beats elaborate re-prompting

Counter-intuitive finding from Zalando:

"As per our tests, a simple 'continue' prompt proved more reliable than more complex prompts to continue the transformation." (Source: sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries)

Why: a more elaborate prompt ("please continue from where you left off, and make sure to emit the rest of the transformed button component, without repeating what you already wrote") adds ambiguity and re-opens decoding choices. The model may:

  • Try to re-summarise what was already written (duplicating output).
  • Re-interpret the task in light of the new instructions (changing style mid-stream).
  • Refuse if the elaborate prompt phrases things as a new task ("I can't continue; please restart").

The minimal nudge — "continue" — signals nothing but "keep going" and lets the trained-completion prior do the work.

Zalando's application

The Zalando Partner Tech team hit the 4K output-token limit on GPT-4o for files exceeding a certain length. Instead of asking the model to stream or return partial JSON, they used the llm library's conversation API and a while truncated: send("continue") loop around the per-file transformation. The output is concatenated; the downstream parser sees a single <updatedContent>…</updatedContent> block.

Limits / failure modes

  • Repetition. Sometimes the model re-emits the last few tokens of the previous response before continuing. Zalando does not describe a de-dup pass, implying this was not common enough to be a reliability issue under the <updatedContent> fencing contract (the parser finds the closing tag regardless).
  • State drift on very long chains. Two or three continuations were fine; pathological cases of 10+ continuations would bring more drift. Not quantified in the article.
  • Not a substitute for larger budgets. If the overwhelming majority of files need a continuation, the right response is to split the work (smaller input files, smaller scope per call), not repeatedly re-pay the multi-turn cost.
  • Provider-specific. Some providers' chat APIs do not preserve response state across user messages the way OpenAI's does. The primitive is simplest on providers with stateful conversation semantics.

Relation to other primitives

  • vs streaming. Streaming returns the response as it's being generated; it doesn't change the max_tokens cap. continue is a recovery mechanism after the cap is hit.
  • vs re-running with higher max_tokens. Raising the limit (if the model supports it) avoids truncation at cost of longer single-request latency and possibly higher per-request cost. continue keeps per-request size small and only pays the continuation cost when needed.
  • vs JSON mode / structured output. Structured-output APIs still truncate at the token limit; continue applies there too, but the intermediate partial is no longer valid JSON and needs a terminal-state assembler.

Seen in

Last updated · 501 distilled / 1,218 read