Skip to content

CONCEPT Cited by 1 source

Temperature-zero for deterministic code generation

Definition

Temperature=0 is the LLM-sampling setting that makes the decoder deterministically pick the highest-probability token at every step (greedy decoding), collapsing the output distribution from stochastic to deterministic. The canonical lever for achieving reproducibility of LLM outputs in code-generation pipelines where the same input must produce the same output across runs.

Why it's load-bearing for code-migration pipelines

Code-migration tools have two properties that make reproducibility non-negotiable:

  1. Prompt-regression tests must be stable. The same input example must produce the same output — otherwise a CI test fails sometimes and not others, and prompt drift is undetectable against the noise floor.
  2. Debugging is intractable without reproduction. If a transformation fails on one run but succeeds on the next, engineers can't bisect the prompt to find the cause.

Zalando encountered both problems in the hackathon phase:

"Initially, we noticed varying outputs for the same input, making testing and validation challenging. Changing LLM settings, like setting the temperature parameter to 0 made the LLM's output to be more deterministic and reproducible." (Source: sources/2025-02-19-zalando-llm-powered-migration-of-ui-component-libraries)

What "deterministic" means in practice

  • Same input + same model version + temperature=0 → same output for that request. Deterministic enough to pin a golden output in a regression test.
  • Same input across time at temperature=0 can still vary because the model's backend may change. Zalando also describes "moody behaviour""LLM tools occasionally produced inconsistent outputs. These issues appeared without any clear reason, sometimes simply by rerunning the same prompt on the same file at a different time." This is provider-side non-determinism (batching, routing, MoE-expert selection, fp16/fp32 differences between GPU nodes) that temperature=0 cannot suppress. Determinism is best-effort, not absolute.

Tradeoffs

  • Loses creativity on open-ended tasks. Greedy decoding can produce less diverse outputs for creative writing or brainstorming tasks. For code-migration — where correctness is binary and creativity is a liability — this is a feature, not a bug.
  • Susceptible to degenerate loops. Greedy decoding sometimes gets stuck in repetition loops (the top-1 token at each step leads back to itself). In practice this is rare for well-instructed code-generation tasks but happens occasionally with buggy prompts.
  • Doesn't prevent model updates from changing outputs. When the provider rolls out a new checkpoint, the golden outputs need to be regenerated. Lock the model version (e.g. gpt-4o-2024-08-06 rather than gpt-4o) to defer this.

Interaction with prompt caching

Temperature=0 is independent of prompt caching — caching works on the prefix regardless of sampling settings, and determinism works on the output regardless of caching. But they're often paired in production: caching is the cost/latency lever, temperature=0 is the reproducibility lever; together they produce a cheap, fast, predictable LLM call.

Role in regression testing

With temperature=0, the same input → same output invariant makes prompt- regression tests tractable: CI runs the toolkit over the golden example library and diffs the output against a checked-in expected file. Without temperature=0, the diffs would flake and the tests would be ignored.

Seen in

Last updated · 501 distilled / 1,218 read