SYSTEM Cited by 2 sources

Databricks Foundation Model API¶

Databricks Foundation Model API is the managed model-endpoint product that hosts Databricks-curated and customer-curated foundation models (text + multimodal) behind a stable API. Clients reach it directly or indirectly via AI Functions in SQL.

Stub page. One ingested-source instance so far — the MapAid groundwater pipeline uses a multimodal model served through the Foundation Model API as the OCR engine during the per-page extraction pass on water-flagged documents.

Capabilities cited in ingested sources¶

Multimodal serving. Handles English + Arabic + handwritten field notes + tabular data + mixed-format pages within a single endpoint call. The pipeline does not pre-route by language or format; the multimodal model handles it all natively (Source: sources/2026-05-11-databricks-unlocking-the-archives).
Page-image input for OCR-as-visual-task. Rather than running a classical OCR engine and then post-processing, the pipeline sends each page image to a Foundation-Model-API multimodal endpoint and treats the model's text response as the OCR output. See patterns/visual-first-document-extraction.
Entity-recognition during OCR pass. While returning page text, the same call is prompted to identify well/borehole identifiers as anchor entities so records spanning multiple pages can be linked back to a single site (Source: sources/2026-05-11-databricks-unlocking-the-archives).
Implicit prompt caching across the OSS model catalog (GA 2026-05-22). FMAPI ships implicit prompt caching as a default-on substrate property — "the caching is implicit: customers do not need to configure anything" — covering batch inference, pay-per-token, and provisioned-throughput workloads. Cache placement is volatile-memory only, isolated per tenant, never persisted (the volatile-only-prompt-cache-isolation safety envelope). Currently enabled for GPT-OSS 20B + 120B, Gemma 3 12B, fine-tuned Llama 3.1 8B (via PEFT serving), Llama 3.1 8B, and Llama 3.3 70B. Disclosed outcome on the GPT-OSS production batch-inference pipeline: +2.5× per-replica input-token throughput, 3× P50 latency reduction at a 30% cache hit ratio (Source: sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models).

Architectural role¶

In the MapAid groundwater pipeline the Foundation Model API serves the deep-extract pass of a two-pass classify-then-extract pipeline — only documents flagged as water-relevant during the cheap classification pass earn a full page-by-page Foundation-Model-API OCR call. This budget-allocation move is the point: cheap inference everywhere, expensive multimodal OCR only on the ~50% of the corpus that justifies it.

Relationship to AI Functions¶

ai_query is the SQL-callable front door; the Foundation Model API is a model endpoint that ai_query (or any other client) can target. They are not alternatives — in the MapAid pipeline they layer: ai_query calls multimodal endpoints served via the Foundation Model API.

Relationship to Databricks Model Serving¶

systems/databricks-model-serving is the broader managed real-time-inference platform (covered at platform-internals depth in the 2026-05-08 Databricks/Superhuman post); the Foundation Model API is the specifically-foundation-model-shaped product layered on top of that infrastructure.

Seen in¶

sources/2026-05-11-databricks-unlocking-the-archives — canonical wiki instance for multimodal OCR + entity recognition during the extraction pass on water-flagged documents.
sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models — GA disclosure of FMAPI prompt caching for open-weights models: implicit, volatile- only, multi-tenant-isolated; +2.5× throughput / 3× P50 latency reduction at 30% hit ratio on the GPT-OSS batch-inference rollout.