Skip to content

CONCEPT Cited by 1 source

Constrained decoding (structured output)

Definition

Constrained decoding is the family of LLM inference techniques that restrict the set of tokens the model can emit at each generation step, so the output is guaranteed to conform to a formal schema — JSON schema, grammar, regex, enumerated vocabulary, or logit-biasing masks.

The common mechanisms:

  • JSON-schema-guided decoding (Outlines, llama.cpp GBNF, OpenAI's response_format: json_schema, vLLM guided decoding).
  • Regex-guided decoding — the output must match a regex.
  • Grammar-guided decoding — a BNF or PEG grammar defines valid completions.
  • Logit biasing — manually zero-out logits for disallowed tokens (coarse-grained).
  • Finite-state-machine guidance — a precomputed FSM over the tokenizer drives the generation.

Guaranteed output shape is the primary win; the structured output is then directly consumable by downstream code without a parser step.

Why it matters for cascaded LLM pipelines

In a cascaded LLM generation pipeline, one phase's output is the next phase's input. If the intermediate artefact's shape is unreliable (occasional missing-field JSON, typos in enum values), the downstream phase breaks.

Three reasons constrained decoding is load-bearing at the pipeline layer:

  1. Interpretability. Phase-1 themes with a structured schema (title, persona, derived-concept list) are human-inspectable
  2. operator-debuggable.
  3. Downstream consumability. Phase 2 doesn't need a parser for messy free-form output.
  4. Operational safety. The shape guarantee is an invariant — parsing failures can't take down the pipeline.

Canonical wiki instance — Instacart generative recommendations platform (2026-02-26)

Source: sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms

Instacart's Phase 1 (page design + theme generation) uses constrained decoding with a structured schema. From the post:

"We leverage constrained decoding with a structured schema to ensure interpretability and downstream usability."

The schema details are not disclosed; the post names interpretability + downstream usability as the named motivations, which matches the "structured output for the next phase" cascade-pipeline role.

This is an enabler choice, not a performance choice — without a structured Phase-1 output, Phase 2's retrieval keyword generation can't reliably consume Phase-1 themes + derived signals.

Tradeoff: constrained decoding vs reasoning quality

Tam et al. 2024 (arXiv:2408.02442) — cited by Instacart's sibling LACE system — shows that restricted decoding can hurt reasoning quality compared to free-form output. The canonical production workaround is decouple reasoning from structured output: the strong reasoner emits free-form rationale, a cheaper step emits the structured output.

Instacart's Phase 1 is a generation task (themes + personas + concepts) where the schema is a presentation shape rather than a reasoning constraint — the tradeoff is less acute than in a multi-dimensional judge setting.

Mechanism comparison

Mechanism Guarantee Performance Ecosystem
JSON-schema decoding Schema-valid JSON ~0-10% slower Broad (Outlines, vLLM, llama.cpp, OpenAI)
Grammar decoding Grammar-valid string ~0-10% slower vLLM, llama.cpp
Regex decoding Regex-matching string Fast Outlines, some hosted APIs
Logit biasing Token set Very fast OpenAI's logit_bias, others
FSM-guided FSM-language string Very fast Outlines

Hosted providers (OpenAI, Anthropic, Google) typically surface JSON-schema decoding as response_format. Self-hosted inference (vLLM, llama.cpp) exposes grammar-based or FSM-based guidance.

Failure modes

  • Schema over-constrains the model. Tight JSON schemas can force the model into poor completions to satisfy the shape; loosen the schema or decouple reasoning from output.
  • Schema drift. Schema changes mid-generation without pipeline-wide update cause downstream parse failures despite the model "obeying" its local schema.
  • Enum-value hallucination inside strings. JSON schema guarantees the shape but not the values — a status field guaranteed to be a string can still have unexpected values.
  • Performance overhead at decode time. Schema compilation + per-token validation adds 5-15% latency in most implementations; usually acceptable.

Seen in

Last updated · 517 distilled / 1,221 read