CONCEPT Cited by 1 source

Constrained decoding (structured output)¶

Definition¶

Constrained decoding is the family of LLM inference techniques that restrict the set of tokens the model can emit at each generation step, so the output is guaranteed to conform to a formal schema — JSON schema, grammar, regex, enumerated vocabulary, or logit-biasing masks.

The common mechanisms:

JSON-schema-guided decoding (Outlines, llama.cpp GBNF, OpenAI's response_format: json_schema, vLLM guided decoding).
Regex-guided decoding — the output must match a regex.
Grammar-guided decoding — a BNF or PEG grammar defines valid completions.
Logit biasing — manually zero-out logits for disallowed tokens (coarse-grained).
Finite-state-machine guidance — a precomputed FSM over the tokenizer drives the generation.

Guaranteed output shape is the primary win; the structured output is then directly consumable by downstream code without a parser step.

Why it matters for cascaded LLM pipelines¶

In a cascaded LLM generation pipeline, one phase's output is the next phase's input. If the intermediate artefact's shape is unreliable (occasional missing-field JSON, typos in enum values), the downstream phase breaks.

Three reasons constrained decoding is load-bearing at the pipeline layer:

Interpretability. Phase-1 themes with a structured schema (title, persona, derived-concept list) are human-inspectable
operator-debuggable.
Downstream consumability. Phase 2 doesn't need a parser for messy free-form output.
Operational safety. The shape guarantee is an invariant — parsing failures can't take down the pipeline.

Canonical wiki instance — Instacart generative recommendations platform (2026-02-26)¶

Source: sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms

Instacart's Phase 1 (page design + theme generation) uses constrained decoding with a structured schema. From the post:

"We leverage constrained decoding with a structured schema to ensure interpretability and downstream usability."

The schema details are not disclosed; the post names interpretability + downstream usability as the named motivations, which matches the "structured output for the next phase" cascade-pipeline role.

This is an enabler choice, not a performance choice — without a structured Phase-1 output, Phase 2's retrieval keyword generation can't reliably consume Phase-1 themes + derived signals.

Tradeoff: constrained decoding vs reasoning quality¶

Tam et al. 2024 (arXiv:2408.02442) — cited by Instacart's sibling LACE system — shows that restricted decoding can hurt reasoning quality compared to free-form output. The canonical production workaround is decouple reasoning from structured output: the strong reasoner emits free-form rationale, a cheaper step emits the structured output.

Instacart's Phase 1 is a generation task (themes + personas + concepts) where the schema is a presentation shape rather than a reasoning constraint — the tradeoff is less acute than in a multi-dimensional judge setting.

Mechanism comparison¶

Mechanism	Guarantee	Performance	Ecosystem
JSON-schema decoding	Schema-valid JSON	~0-10% slower	Broad (Outlines, vLLM, llama.cpp, OpenAI)
Grammar decoding	Grammar-valid string	~0-10% slower	vLLM, llama.cpp
Regex decoding	Regex-matching string	Fast	Outlines, some hosted APIs
Logit biasing	Token set	Very fast	OpenAI's `logit_bias`, others
FSM-guided	FSM-language string	Very fast	Outlines

Hosted providers (OpenAI, Anthropic, Google) typically surface JSON-schema decoding as response_format. Self-hosted inference (vLLM, llama.cpp) exposes grammar-based or FSM-based guidance.

Failure modes¶

Schema over-constrains the model. Tight JSON schemas can force the model into poor completions to satisfy the shape; loosen the schema or decouple reasoning from output.
Schema drift. Schema changes mid-generation without pipeline-wide update cause downstream parse failures despite the model "obeying" its local schema.
Enum-value hallucination inside strings. JSON schema guarantees the shape but not the values — a status field guaranteed to be a string can still have unexpected values.
Performance overhead at decode time. Schema compilation + per-token validation adds 5-15% latency in most implementations; usually acceptable.

Seen in¶

sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — canonical wiki instance at Instacart's Shopping Hub Phase 1. Constrained decoding with a structured schema; motivation is interpretability + downstream usability in a cascaded pipeline.

concepts/structured-output-reliability — the operational observability axis (malformed-JSON rate).
concepts/decouple-reasoning-from-structured-output — the production workaround when schema constraints hurt reasoning.
concepts/cascaded-llm-generation — the host context in which constrained decoding is load-bearing.
patterns/top-down-cascaded-page-generation — the canonical production pattern that depends on Phase 1's structured output.
systems/instacart-generative-recommendations-platform — canonical production consumer.
companies/instacart