PATTERN Cited by 1 source

SQL-Native Multimodal LLM Inference¶

SQL-Native Multimodal LLM Inference is the pattern of exposing LLM (including multimodal) inference as a callable function inside SQL / DataFrame / streaming queries so model calls compose with the rest of the data pipeline like any other column expression — no separate model-serving service, no separate ETL job to fan out inference, no glue code to bridge "the data warehouse" and "the model endpoint."

Problem¶

Conventional LLM-in-pipeline architectures bolt model serving onto the side of the data platform:

data warehouse ──> ETL job ──> HTTP request to model service ──> write back
                       ▲                  │
                       │                  ▼
                       └── retry logic ──┘
                       └── batch sizing ──┘
                       └── auth ──┘
                       └── model versioning ──┘
                       └── format-conversion glue ──┘

Every pipeline that wants LLM inference re-invents this scaffolding. Iteration on prompts means redeploying the ETL job. Schema changes mean coordinating across the model service and the warehouse. Multimodal input (images) requires custom encoding/serialization. A team that just wants "classify this column with an LLM" writes a service to do it.

Solution¶

Expose model inference as a first-class function in the query language. The function takes a model endpoint reference, a prompt, an input column (text or image), and an output schema, and returns the model's response as a typed column.

SELECT
  doc_id,
  page_number,
  ai_query(
    'multimodal-classifier-endpoint',
    'Classify this scanned page. Return Dewey codes, geographies, and a water-relevance flag.',
    page_image,
    responseFormat => '{"type":"json_schema", "schema": {...}}'
  ) AS classification
FROM unity_catalog.archive.rendered_pages
WHERE document_length > 5

That's the pipeline. The model call is a column expression. There is no separate service to deploy, no separate ETL to schedule, no serialization glue. Multimodal input is just a column. Structured output is just a responseFormat parameter.

In the MapAid groundwater pipeline¶

The MapAid pipeline uses ai_query in three load-bearing stages of the same pipeline:

Classification pass — ai_query over sampled page-images produces Dewey codes + geographies + water flag.
Extraction pass — ai_query over per-page OCR'd text + schema-constrained output emits JSON well/borehole records.
Judge pass — ai_query against a different judge model scores each classification. See patterns/llm-judge-as-inline-pipeline-stage.

"Because AI Functions run directly within SQL, the team could iterate on prompts and output schemas without building separate model-serving infrastructure." (Source: sources/2026-05-11-databricks-unlocking-the-archives)

The architectural punch line is in that last clause: iteration cost is a query refactor, not an infrastructure deploy.

Mechanics¶

Inputs are columns. Text columns, image columns (e.g. Volume-stored page images), structured columns. The function doesn't care about input modality — the platform handles serialisation.
Outputs are typed columns. Schema-constrained responses (per concepts/schema-constrained-llm-output) deserialize directly into typed columns or structs.
Endpoints are referenced by name. Switch models by changing the endpoint string. Ship A/B tests with a CASE expression.
Composes with all SQL/DataFrame primitives. Filter on model output. Join model outputs to source tables. Window over them. Aggregate them. Stream them.
Storage substrate. Inputs live in Unity Catalog Volumes (raw files / images) or Delta Lake (intermediate tables); outputs land in Delta with ACID + lineage.

What this pattern collapses¶

A traditional LLM pipeline has all of these as separate pieces:

Model-serving cluster
ETL/batch job that fans out requests
Retry / rate-limit logic
Batch-size tuning
Auth between ETL and model service
Model versioning + endpoint routing
Schema definition for inputs + outputs
Glue code to (de)serialise multimodal inputs
Glue code to write outputs back to warehouse

SQL-native multimodal inference collapses all of those into:

An ai_query call.

That's the pattern's value. Everything that used to be Python + Kubernetes is now a SQL clause.

When to use¶

LLM inference whose inputs and outputs naturally live in tables — text columns, image columns, structured records.
Pipelines whose iteration speed matters more than custom inference orchestration.
Teams without dedicated ML-platform engineers (analysts, scientists, partner organizations).
Multimodal workloads where input format would otherwise require custom serialisation glue.

When not to use¶

Real-time interactive inference in user-facing latency budgets — ai_query runs as part of a query plan, not a request handler. Use Databricks Model Serving / Foundation Model API direct endpoints for that.
Workloads with custom request shaping (streaming token output, custom retry semantics, fine-grained rate-limit handling) that exceed what the SQL surface exposes.
Stateful multi-turn agents — ai_query is per-row inference, not a conversation.

Tradeoffs¶

Vendor coupling. AI Functions are Databricks-specific. Equivalent surfaces exist in some other warehouses (Snowflake Cortex, BigQuery ML), but the pattern as a whole is bound to whichever data platform you're on.
Limited control over inference internals. Batching, retries, rate limiting, and prompt-cache behaviour are platform-managed — good for ergonomics, bad if you need fine control.
Cost transparency. Per-row inference cost shows up in compute bills, not as a separate model-serving line item; can be harder to attribute.

Seen in¶

sources/2026-05-11-databricks-unlocking-the-archives — canonical wiki instance. ai_query used in three stages (classify / extract / judge) of the MapAid groundwater pipeline; multimodal page-image input; schema-constrained JSON output; iteration-without-separate- infra explicitly framed as the architectural value.