CONCEPT Cited by 1 source

Decouple reasoning from structured output¶

Definition¶

Decouple reasoning from structured output is the two-pass LLM design in which:

Pass 1 — Reasoning. A strong reasoning model produces free-form text (rationale + verdict) with no format constraint, optimising for reasoning quality.
Pass 2 — Formatting. A separate step — either a cheaper LLM, a format-aware model, or a rule-based parser — converts Pass 1's output into the downstream structured schema (JSON / Pydantic / grammar-constrained format).

The design explicitly breaks the usual single-call "produce JSON directly" pattern to escape the quality-vs-format tension.

Instacart's LACE canonicalises this pattern for chatbot evaluation (Source: sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm).

Why decouple¶

"[JSON] can negatively affect performance due to restricted decoding... while many models support structured outputs, their ability to produce reliable and consistently formatted JSON varies." (Instacart LACE)

Two forces converge:

Restricted-decoding quality loss. Constraining an LLM's output to a grammar (JSON schema, regex, Pydantic) is known to reduce reasoning quality on hard tasks. The model's best token under the constraint is often not its globally-best token; the rejection path can cascade into degraded reasoning.
Best-at-reasoning ≠ best-at-formatting. At Instacart's LACE writing time, o1-preview was "our best-performing option at the time but lacked consistent JSON formatting capabilities" — so requiring one model to do both forced choosing between reasoning quality and format reliability.

The decouple eliminates the trade-off: use the best reasoner for the hard task, use something else for the easy task of rearranging its output into JSON.

Architecture¶

   Input (chat + criterion)
        │
        ▼
   ┌─────────────────────┐
   │ Reasoning LLM       │  ← free-form rationale + verdict
   │ (strongest model,   │    e.g. o1-preview on LACE
   │  unconstrained)     │
   └─────────┬───────────┘
             │ rationale (prose)
             ▼
   ┌─────────────────────┐
   │ Formatter           │  ← structured-output step
   │ (cheaper LLM OR     │    emits per-criterion JSON
   │  rule-based parser) │    {score: T/F, explanation: "..."}
   └─────────┬───────────┘
             │
             ▼
   Downstream consumer (dashboard / experimentation platform)

Two important properties:

Pass 2 has low reasoning load. Converting "the chatbot correctly integrated the prior turn's context" (Pass 1 prose) into {"contextual_relevancy": true, "explanation": "..."} is a mechanical rearrangement, not a reasoning task. A small structured-aware model (or deterministic rules if Pass 1's output is disciplined) handles it reliably.
Pass 1 output is auditable. The rationale is preserved as a first-class artefact, not just an intermediate. LACE retains it to "guide future refinement" of the criterion prompts.

Contrast with alternatives¶

Strategy	Pros	Cons
Single call, grammar-constrained JSON	simple, one call, one cost	reasoning-quality cost on hard tasks (concepts/structured-output-reliability framing)
Single call, prompt-asks-for-JSON, no grammar	no decoding constraint	parse-failure mode common; needs retry or Pydantic recovery
Decouple (this concept)	best-in-class reasoning + reliable format	two LLM calls per item, slightly higher latency + cost
patterns/drafter-evaluator-refinement-loop	best when quality needs iteration, not just format	higher cost, not aimed at format problem

concepts/structured-output-reliability — the problem statement (Dash / Lyft / Slack all hit this); decouple is one of the solutions.
concepts/pydantic-structured-llm-output — a format-layer primitive often sitting at Pass 2.
patterns/one-model-invocation-per-task — the generalised single-responsibility framing.

Tradeoffs / when it doesn't apply¶

When Pass 2 isn't actually easier. If downstream JSON requires non-trivial semantic extraction or transformation, a one-pass structured-output call may be cheaper than two.
When a single model handles both well enough. GPT-4-class and newer (Claude 3.5+, Gemini 1.5+, GPT-4o) emit structured output reliably for simple schemas; decouple adds latency for small gain.
When latency budget is tight. Two LLM calls serially is double the round-trip time. For offline evaluation (LACE's case) this is fine; for real-time user-facing paths it may be unacceptable.

Seen in¶

sources/2025-06-11-instacart-turbocharging-customer-support-chatbot-development-with-llm — canonical wiki instance at Instacart LACE: o1-preview reasoning → cheaper/parser formatting step, explicitly motivated by "JSON formatting... can negatively affect performance due to restricted decoding."