CONCEPT Cited by 1 source
Constrained decoding (structured output)¶
Definition¶
Constrained decoding is the family of LLM inference techniques that restrict the set of tokens the model can emit at each generation step, so the output is guaranteed to conform to a formal schema — JSON schema, grammar, regex, enumerated vocabulary, or logit-biasing masks.
The common mechanisms:
- JSON-schema-guided decoding (Outlines, llama.cpp GBNF,
OpenAI's
response_format: json_schema, vLLM guided decoding). - Regex-guided decoding — the output must match a regex.
- Grammar-guided decoding — a BNF or PEG grammar defines valid completions.
- Logit biasing — manually zero-out logits for disallowed tokens (coarse-grained).
- Finite-state-machine guidance — a precomputed FSM over the tokenizer drives the generation.
Guaranteed output shape is the primary win; the structured output is then directly consumable by downstream code without a parser step.
Why it matters for cascaded LLM pipelines¶
In a cascaded LLM generation pipeline, one phase's output is the next phase's input. If the intermediate artefact's shape is unreliable (occasional missing-field JSON, typos in enum values), the downstream phase breaks.
Three reasons constrained decoding is load-bearing at the pipeline layer:
- Interpretability. Phase-1 themes with a structured schema (title, persona, derived-concept list) are human-inspectable
- operator-debuggable.
- Downstream consumability. Phase 2 doesn't need a parser for messy free-form output.
- Operational safety. The shape guarantee is an invariant — parsing failures can't take down the pipeline.
Canonical wiki instance — Instacart generative recommendations platform (2026-02-26)¶
Source: sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms
Instacart's Phase 1 (page design + theme generation) uses constrained decoding with a structured schema. From the post:
"We leverage constrained decoding with a structured schema to ensure interpretability and downstream usability."
The schema details are not disclosed; the post names interpretability + downstream usability as the named motivations, which matches the "structured output for the next phase" cascade-pipeline role.
This is an enabler choice, not a performance choice — without a structured Phase-1 output, Phase 2's retrieval keyword generation can't reliably consume Phase-1 themes + derived signals.
Tradeoff: constrained decoding vs reasoning quality¶
Tam et al. 2024 (arXiv:2408.02442) — cited by Instacart's sibling LACE system — shows that restricted decoding can hurt reasoning quality compared to free-form output. The canonical production workaround is decouple reasoning from structured output: the strong reasoner emits free-form rationale, a cheaper step emits the structured output.
Instacart's Phase 1 is a generation task (themes + personas + concepts) where the schema is a presentation shape rather than a reasoning constraint — the tradeoff is less acute than in a multi-dimensional judge setting.
Mechanism comparison¶
| Mechanism | Guarantee | Performance | Ecosystem |
|---|---|---|---|
| JSON-schema decoding | Schema-valid JSON | ~0-10% slower | Broad (Outlines, vLLM, llama.cpp, OpenAI) |
| Grammar decoding | Grammar-valid string | ~0-10% slower | vLLM, llama.cpp |
| Regex decoding | Regex-matching string | Fast | Outlines, some hosted APIs |
| Logit biasing | Token set | Very fast | OpenAI's logit_bias, others |
| FSM-guided | FSM-language string | Very fast | Outlines |
Hosted providers (OpenAI, Anthropic, Google) typically surface
JSON-schema decoding as response_format. Self-hosted inference
(vLLM, llama.cpp) exposes grammar-based or FSM-based guidance.
Failure modes¶
- Schema over-constrains the model. Tight JSON schemas can force the model into poor completions to satisfy the shape; loosen the schema or decouple reasoning from output.
- Schema drift. Schema changes mid-generation without pipeline-wide update cause downstream parse failures despite the model "obeying" its local schema.
- Enum-value hallucination inside strings. JSON schema
guarantees the shape but not the values — a
statusfield guaranteed to be a string can still have unexpected values. - Performance overhead at decode time. Schema compilation + per-token validation adds 5-15% latency in most implementations; usually acceptable.
Seen in¶
- sources/2026-02-26-instacart-our-early-journey-to-transform-discovery-recommendations-with-llms — canonical wiki instance at Instacart's Shopping Hub Phase 1. Constrained decoding with a structured schema; motivation is interpretability + downstream usability in a cascaded pipeline.
Related¶
- concepts/structured-output-reliability — the operational observability axis (malformed-JSON rate).
- concepts/decouple-reasoning-from-structured-output — the production workaround when schema constraints hurt reasoning.
- concepts/cascaded-llm-generation — the host context in which constrained decoding is load-bearing.
- patterns/top-down-cascaded-page-generation — the canonical production pattern that depends on Phase 1's structured output.
- systems/instacart-generative-recommendations-platform — canonical production consumer.
- companies/instacart