Skip to content

CONCEPT Cited by 1 source

Pydantic structured LLM output

Definition

Pydantic structured LLM output is the Python-ecosystem convention of declaring a Pydantic BaseModel for every LLM output shape and treating the model's JSON response as either parseable into that typed object or fully incorrect — no tolerance for free-form text slipping through.

The concept is the Python-idiomatic concrete form of concepts/structured-output-reliability: the schema is the contract; the validator is the boundary; every downstream consumer sees typed objects, never raw model text.

The pattern in practice

from pydantic import BaseModel
from enum import Enum

class Grade(str, Enum):
    PASS = "pass"
    REVISE = "revise"

class TranslationCandidate(BaseModel):
    text: str

class CandidateEvaluation(BaseModel):
    candidate_index: int
    grade: Grade
    explanation: str

class EvaluatorOutput(BaseModel):
    evaluations: list[CandidateEvaluation]
    best_candidate_index: int | None = None

# At the boundary:
result = EvaluatorOutput.model_validate_json(llm_response_text)
# -> either a typed object, or ValidationError is raised.

Downstream code consumes result.evaluations[0].grade as an enum value, never a string match. Schema drift (model adds a field) is caught at parse time, not during production execution.

Why this shape matters for agent pipelines

Multi-agent LLM pipelines pass intermediate outputs between agents — drafter → evaluator, planner → coder → verifier, etc. Each handoff is a potential format failure. Pydantic models:

  • Document the contract — the schema is the spec of what each agent emits.
  • Fail fast — malformed output surfaces at the boundary, not three agents downstream when a field is missing.
  • Enable structured-output APIs — OpenAI's response_format, Anthropic's tool use, and most other structured-output APIs accept JSON schemas that Pydantic generates natively (Model.model_json_schema()).
  • Compose with eval / logging — typed outputs serialize back to canonical JSON for snapshotting, replay, and regression harnesses.

Tradeoffs / gotchas

  • Schema changes are breaking. If the downstream consumer depends on a Pydantic shape and the schema changes, all cached LLM outputs + prompt-tuned few-shot examples need revisiting.
  • Small models break schemas more often than frontier models (concepts/structured-output-reliability). Pydantic validation catches the failure but doesn't fix it — mitigation still requires constrained decoding, few-shot examples, or model upgrade.
  • Pydantic is Python-specific. The same shape applies in TypeScript (zod), Rust (serde + schemars), Go (struct tags + encoding/json), etc — but the ergonomics and integration with LLM SDKs differ.
  • Enum values leak into the prompt. If Grade is an enum {PASS, REVISE}, the prompt must either include the enum values or rely on the LLM SDK's structured-output mode to constrain them. Silent mismatches ("Pass" vs "PASS") cause parse failures.
  • Free-form fields undermine the contract. A schema with explanation: str constrains shape but not content; the explanation string can still be hallucinated nonsense. Schema validation is about structure, not correctness.

Relationship to adjacent concepts

Seen in

  • sources/2026-02-19-lyft-scaling-localization-with-aicanonical wiki instance. Lyft's AI localization pipeline uses Pydantic schemas as the contract between the Drafter and Evaluator agents. Documented shapes: DrafterOutput, TranslationCandidate, EvaluatorOutput, CandidateEvaluation, Grade enum, best_candidate_index: int. "This ensures type safety, reliable parsing, and clear contracts between Drafter and Evaluator."
Last updated · 319 distilled / 1,201 read