Skip to content

SYSTEM Cited by 2 sources

Databricks AI Functions

Databricks AI Functions are SQL-callable LLM inference primitives exposed as built-in functions (the canonical one is ai_query) inside Databricks SQL / DataFrame / Structured Streaming. They run LLM calls inline with table data — no separate model-serving infrastructure required.

Stub page. Documented from a single ingested source so far; the operational profile and pricing details are not in scope.

Capabilities cited in ingested sources

  • ai_query for inference. Single SQL function that takes a model endpoint + prompt + (optionally) input columns and returns the model's response as a column.
  • Multimodal input. Image columns (e.g. rendered PDF pages stored in Unity Catalog Volumes) can be passed directly. The MapAid groundwater pipeline sends each scanned-page image through ai_query for classification — no OCR-as-prerequisite. See patterns/visual-first-document-extraction.
  • Structured JSON output. ai_query enforces a response schema so the team can capture site name / GPS / depth / yield as typed columns even when the underlying scans embed those fields in different formats. Canonicalised in concepts/schema-constrained-llm-output.
  • In-SQL iteration. Because AI Functions are just functions inside SQL/DataFrames, prompt tuning + schema changes are iterated like any query refactor — no separate inference service to deploy.

Architectural role

In the MapAid groundwater pipeline AI Functions appear in three distinct stages:

  1. Classification passai_query over sampled page images to produce Dewey Decimal codes, geographic tags, and a water-relevance flag.
  2. Extraction passai_query over per-page OCR'd text + schema-constrained output to emit JSON well/borehole records.
  3. Judge passai_query against a separate judge model to score each classification on accuracy / completeness / consistency. See patterns/llm-judge-as-inline-pipeline-stage.

The same primitive playing all three roles is the point: SQL-native inference + structured output makes the pipeline an SQL/DataFrame job, not a custom inference service. See patterns/sql-native-multimodal-llm-inference.

Seen in

  • sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-librarySecond canonical instance: Claroty's CSAF (Common Security Advisory Framework — JSON-formatted vulnerability advisories) → Delta-table ETL pipeline orchestrated by Lakeflow Jobs. "In this ETL, and in many more use cases, we use LLMs to enrich the data — from classification tasks and AI Functions like ai_query, using various Serving endpoints and MLflow to evaluate the answers we get from the LLM, using statistic metrics and LLM-as-a-judge, and monitor the cost." Same shape as the MapAid pipeline (ai_query for inline LLM enrichment + judge-pass for evaluation), different domain (security advisories vs groundwater PDFs). The LLM-as-a-Judge face here is conservative-ternary (pass / fail / unknown) and explicitly framed against the absence of fully-labelled ground truth in "real-world CPS data" — composes with patterns/llm-judge-as-inline-pipeline-stage for the step-by-step reliability scoring and vector-search-no- scale-to-zero for the cost-efficiency observation in bursty event-driven workloads. Endpoint heterogeneity is explicit: ai_query calls fan out across "various Serving endpoints" — not a single foundation-model dependency.

  • sources/2026-05-11-databricks-unlocking-the-archives — canonical wiki instance. Three uses (classify / extract / judge) inside one pipeline; multimodal page-image inputs; schema-constrained JSON outputs; iteration-without-separate-infra explicitly called out as the architectural value prop.

Last updated · 542 distilled / 1,571 read