Databricks¶
Databricks Engineering blog. Tier-3 source on the sysdesign-wiki: most posts are product/marketing/ML-methodology oriented and get skipped, but infra-architecture posts (Kubernetes, service mesh, data-platform internals) are worth ingesting when they appear.
Internally Databricks runs hundreds of stateless gRPC services per Kubernetes cluster across thousands of clusters in multiple regions, predominantly in Scala on a monorepo with fast CI/CD. That monoculture is the architectural enabler for several of their infra-platform choices — notably the proxyless service-mesh design.
Key systems¶
-
systems/databricks-custom-model-serving — Custom Model Serving (first wiki disclosure 2026-06-11). Fully managed real-time inference platform for any MLflow model — from 2 MB scikit-learn classifiers to 70B LLMs — without customer-facing tuning knobs. Three structural properties: isolated K8s deployments per endpoint, automatic runtime selection (Gunicorn/vLLM/Triton), and the AutoPilot Pod Autoscaler. Operating envelope: 300K+ QPS, 99.99% availability, 10→10K QPS in <60s. Eliminates the ML Stack Tax.
-
systems/databricks-autopilot-pod-autoscaler — AutoPilot Pod Autoscaler (APA) (first wiki disclosure 2026-06-11). Custom Kubernetes controller implementing two-axis autoscaling: horizontal (request-based, 5s interval) + vertical (model-aware concurrency tuning, 30s interval). The two axes are coupled — vertical output feeds horizontal formula. Asymmetric in both axes: aggressive up, conservative down. Heart of Custom Model Serving.
-
systems/lakebase-scm-extension — Lakebase SCM Extension (first wiki disclosure as a dedicated page 2026-05-30). The open-source VS Code / Cursor IDE extension maintained by
databricks-solutionsthat synchronises a developer's git branch with a matching Lakebase database branch — automating the per-developer paired-branch pattern inside the IDE — and surfaces the Branch Diff Summary view that the 2026-05-29 evolutionary-database-development post cites as the canonical schema-diff format. The IDE-substrate-glue primitive that closes the loop on Jen's per-developer branching workflow; alternative to thedatabricks postgres create-branchCLI flow. Public GitHub repo, behaviour-only disclosure in the source so far; architectural depth deferred to the forthcoming Companion: Plugin Walkthrough post that the 2026-05-29 article forward-references. -
systems/enzyme-ivm — Enzyme (first wiki disclosure as a dedicated page 2026-05-30). The incremental-view-maintenance engine that powers SDP's
@dp.materialized_viewdecorator. Subject of the SIGMOD 2026 honorable-mention paper "Enzyme: Incremental View Maintenance for Data Engineering" (arXiv:2603.27775, presented by Ritwik Yadav at SIGMOD 2026 in Bangalore). Establishes the two-track incremental-processing architecture inside SDP: Enzyme on the materialized-view track, Structured Streaming on the explicit-streaming track, mix-and-match in one pipeline. Four novel claims over prior industrial IVM: (a) full MV-grammar coverage including joins + windows + aggregations + combinations; (b) non-deterministic function support (current_date(), AI functions); (c) multi-language MVs (Python + SQL); (d) cost-model-driven incrementalisation strategy (partition-level vs row-level updates per run, selective intermediate-result caching, plan-info -
prior-execution-stats inputs). Articulated thesis: MV-as-ETL primitive — "if MVs can be efficiently and incrementally maintained, it will significantly simplify ETL workloads which otherwise require writing complex custom code".
-
systems/iceberg-v3 — Apache Iceberg v3 (first wiki disclosure as a dedicated page 2026-05-29). Spec-level milestone for Apache Iceberg, GA on Databricks 2026-05-28: three new format-level primitives — deletion vectors (file-level row-delete representation accelerating updates / merges / deletes without rewriting data files), row tracking (stable per-row identity for efficient incremental processing), VARIANT type (standard semi-structured-data type) — applied across managed Iceberg, foreign Iceberg, and UniForm-enabled managed tables. Delta-side cross-format compatibility called out verbatim as architectural precondition for the forward-looking Iceberg-v4 + Delta-5.0 adaptive-metadata-tree alignment (concepts/format-co-evolution-iceberg-delta).
-
systems/iceberg-rest-catalog-scan-api — Iceberg REST Catalog Scan Planning API (first wiki disclosure as a dedicated page 2026-05-29). The Iceberg-1.11 client API surface that Unity Catalog uses to extend ABAC across the engine boundary. Server-side scan planning — the catalog evaluates ABAC policies during plan-scan and returns a filtered scan plan; engines read only authorised data. Compatible engines: any implementing the Iceberg-1.11 scan-planning client (Spark, DuckDB named in the announcement). Wiki's first canonical instance of scan-planning-as-policy-enforcement-point.
-
systems/databricks-axon — Axon (first wiki disclosure as a dedicated page 2026-05-29). The Databricks LLM data-plane router, named publicly for the first time in the 2026-05-27 Reliable LLM Inference at Scale post: "the data plane runs a router, which we call Axon, that balances load among replicas of the same model." Built on Dicer. Two structural properties distinguish it from the EDS+P2C path documented for the rest of Databricks Model Serving: load metric is model units (not active request count), and stateful (sticky) sessions route a workload's requests to a Dicer-assigned subset of pods — serving prefix-cache locality and blast-radius bounding simultaneously. Sits between rate-limiting and the inference runtime in the data plane. Production scale: 125T+ tokens/month across frontier OS (Kimi, Qwen) + proprietary (OpenAI, Gemini, Claude) models. The wiki's first canonical instance of patterns/cost-based-load-balancing-llm + patterns/stateful-llm-session-routing.
-
systems/neon — Neon (first wiki disclosure as a dedicated page 2026-05-29; prior recurring tag mention across systems/lakebase / systems/pageserver-safekeeper). Serverless, separated-compute-and-storage Postgres, acquired by Databricks in 2025; architecturally identical to systems/lakebase (the Databricks-branded packaging). Operates the neonstatus.com public status surface. The 2026-05-27 reliability roadmap discloses Neon's empirical session-lifetime distribution: 90% of compute sessions for auto-suspending databases are <10 min — the load-bearing signal for the control plane is the new data plane reframe. Engineering lineage: Stas Kelvich (Neon co-founder) → Postgres internals expertise (multi-master replication with quorum commit, cross-node snapshot isolation under loosely-synchronised clocks).
-
systems/sqlsmith — SQLsmith (first wiki disclosure 2026-05-29). Open-source random SQL query generator paired with SQLancer in systems/lakebase's release-gate validation harness. SQLsmith finds crashes / assertion-violations under random query loads; SQLancer finds logic bugs against an SQL-standard oracle. Both run while fault injection is running so postcondition violations are detected during chaos drills. Verbatim from the source: "We utilize open source tools like SqlLancer and SqlSmith, along with similar internal tools, to verify correct Postgres behavior." See patterns/continuous-fault-injection-in-production.
-
systems/databricks-metric-views — Metric Views (first wiki disclosure 2026-05-27). UC-resident headless-BI semantic layer: define metrics ONCE in Unity Catalog (SQL or UC Explorer UI), every consumer (AI/BI Dashboards, Genie, SQL notebooks, third-party BI tools) resolves the same
MEASURE()definition. Semantic metadata fields (display_name/comment/synonyms) double as AI grounding context for Genie — natural-language questions get mapped to the right measure / dimension via the metadata, "no custom prompts, no separate glossary". Materialization is a substrate property (auto pre-aggregation + incremental refresh + intelligent query rewriting + transparent routing) — collapses three coupled artifacts (aggregate tables + refresh pipelines + BI-tool query updates) into one governed primitive. Open-standard provenance: SPARK-54119 (Apache Spark Metric Views OSS implementation) + UC OSS support coming. Canonical instance of concepts/headless-bi-semantic-layer + concepts/metric-view-materialization + patterns/governed-metric-as-headless-bi-substrate + patterns/auto-materialized-aggregation-via-semantic-layer + patterns/query-rewrite-to-pre-aggregated-materialization. - systems/databricks-predictive-optimization — Predictive
Optimization (first wiki disclosure as a dedicated page
2026-05-27; prior recurring tag mention only). Default-on
for UC managed tables;
automatically runs
OPTIMIZE,VACUUM, and statistics collection on tables that would benefit, "so you don't need to schedule these jobs yourself". Inline-during-Photon-writes collection of two statistics planes — Delta data-skipping statistics (per-file min/max/null for file pruning) + query optimizer statistics (cardinalities + distributions for plan choice). Back-fills stats for existing tables. 22% average performance improvement in observed workloads; "For BI workloads with repetitive filter patterns, the impact is especially significant". Drives theCLUSTER BY AUTOoption on Liquid Clustering — workload-aware automatic cluster-key selection. Canonical instance of concepts/automatic-table-optimization - concepts/optimizer-statistics-as-skipping-substrate.
- systems/databricks-sql-warehouses — DBSQL Warehouses
(first wiki disclosure as a dedicated system 2026-05-27).
Compute layer for BI / analytical SQL queries on the
Lakehouse; serverless auto-scaling (absorbs concurrency
bursts, pay-per-use); two-tier cache hierarchy —
disk cache (warehouse-local SSD for hot Parquet files) +
Query Result Cache (QRC) (full results keyed on query
text + table version, served without re-execution). For BI's
repetitive query patterns: "caching turns many requests into
millisecond-latency responses at near-zero compute cost".
Reflexive observability surface:
system.billing.usage+system.query.historysystem tables. Canonical instance of concepts/dbsql-caching-tiers. - systems/databricks-ai-bi-dashboards — AI/BI Dashboards
(first wiki disclosure 2026-05-27). First-party Databricks
dashboard surface; one of four named consumers of
Metric Views alongside
Genie / SQL notebooks / third-party BI tools. Source's worked
example: warehouse-metrics dashboard queried from
metv_dbsql_metricsMetric View (reflexively monitoring the warehouse via the same primitive the warehouse serves). - systems/octopus-margin-data-pipeline — Octopus Margin Data Pipeline (first wiki disclosure 2026-05-23). Customer-built Databricks-resident three-stream grain-aligned data pipeline for Octopus Energy's margin / settlement / commercial-KPI calculations under the UK MHHS regulatory transition (2 reads/customer/month → 48 reads/day, 48× volume). Settlement (HH) + Half-Hourly (smart tariffs: EVs, heat pumps, ToU) + Monthly (standard tariffs) on a unified multi-terabyte HH-grain source-of-truth substrate, orchestrated by "Job of Jobs". Single highest-leverage move: Delta CDF on the substrate — 25 B → 300 M rows/run (98.8%), weekly → daily freshness. Operating envelope: $0.48/settlement date (~50× below the projected MHHS cost, 2× below the legacy despite 48× more data), ~$1M/yr cost avoidance (excludes upstream-incremental savings — full gain is larger), 3 months, team of three. Canonical instance of concepts/grain-misalignment + patterns/grain-aligned-stream-split + patterns/job-of-jobs-orchestration + patterns/broadcast-join-for-small-reference-tables (<500 MB threshold disclosed) + concepts/remove-before-add-optimization (Z-ordering / ANALYZE / custom shuffle named as audit targets; AQE outperformed hand-tuning and the team deleted code).
- systems/spark-aqe — Spark Adaptive Query Execution (first wiki disclosure as dedicated page 2026-05-23; prior recurring tag mention only). Runtime query optimiser inside Spark — re-plans during execution using post-shuffle row counts, observed skew, and post-filter join input sizes (statistics the static planner doesn't have). The Octopus rebuild canonicalises the trust-the-optimiser-as-an- architectural-move face: "In several cases, Spark's Adaptive Query Execution (AQE) outperformed hand-tuned logic. The team removed custom optimisation code and let AQE do its job." The action item is deletion, not addition. Pairs with concepts/remove-before-add-optimization as the generalised principle.
- systems/liquid-clustering — Liquid Clustering (first wiki disclosure as dedicated page 2026-05-23; prior recurring tag mention across systems/uc-otel-trace-tables, systems/uc-managed-tables, systems/zerobus-ingest, systems/delta-lake, concepts/lakehouse-native-observability, patterns/telemetry-to-lakehouse, patterns/managed-otel-ingestion-direct-to-lakehouse). Delta Lake feature that "dynamically co-locates related records on the specified clustering keys without requiring fixed partition boundaries" — the partition-replacement primitive. The Octopus disclosure enumerates the over-partitioning failure modes verbatim: "Liquid clustering avoids the small-file problem, higher memory consumption, and I/O overhead that come from over-partitioning." Pairs with patterns/broadcast-join-for-small-reference-tables in the "join and partition tuning" category.
- systems/zerobus-ingest — Zerobus Ingest (first wiki disclosure 2026-05-22; deep architecture disclosure 2026-06-11). Managed serverless push-based streaming ingestion engine. The 2026-06-11 post reveals three key internals: (a) dynamic partitioning with stream-connection-level ordering (not partition-level) enabling true elastic autoscaling; (b) Zeroparser — Rust-based zero-copy protobuf decoder achieving ~1 GB/s/core with dynamic descriptors, outperforming codegen (OSS); (c) latency-optimized WAL with async offset acks via gRPC bidirectional streaming, Delta Kernel Rust for final Delta commit. Benchmark: 12 GB/s sustained / 12M rows/sec / 1.04T rows in 24h from 2,048 concurrent streams. Also serves as the OTel-protocol ingestion endpoint (OTLP/gRPC + REST) for the single-sink telemetry architecture; canonical instance of patterns/stream-connection-as-ordering-unit + patterns/zero-copy-protobuf-decoding + patterns/wal-before-lakehouse-publish.
- systems/uc-otel-trace-tables — UC OTel Trace Tables
(first wiki disclosure 2026-05-22). Six MLflow-derived UC
Delta views —
<prefix>_otel_spans(per-request span execution),_otel_logs(structured log/event data),_otel_metrics(numerical telemetry),_otel_annotations(MLflow-specific tags / assessments / feedback / expectations / run-links),_trace_unified(one record per trace with raw spans + metadata),_trace_metadata(MLflow trace metadata grouped by trace ID — "more performant than the unified view when you only need MLflow trace metadata"). Auto liquid- clustered post the latest product update; MLflow per-experiment trace cap eliminated. Inherits UC governance for free — column masking on prompt/response columns, row-level filtering by tenant tag, RBAC, audit logs. Sibling lakehouse audit substrate to Inference Tables at a different granularity (one row per span vs one row per model call). - systems/mlflow-otel-tracing — MLflow OTel Tracing
(first wiki disclosure 2026-05-22). Framework-side OTel
instrumentation surface within MLflow 3 —
mlflow.<lib>. autolog()for automatic per-call span emission (LangChain, LangGraph, OpenAI SDK, etc.) +@MLflow.tracedecorator for the request-level root span. Provisions the UC OTel trace tables from Python on experiment setup; same code path whether the agent runs inside Databricks, in the customer's VPC, on a developer laptop, or in a third-party cloud ("In fact the support assistant agent example that was used for this blog is deployed locally"). Substrate for the prod-traces-as-eval-dataset flow (see concepts/production-traces-as-evaluation-substrate + patterns/bootstrap-eval-dataset-from-production-traces) — same judges run on bootstrapped historical-prod traces in dev and on live traces in production. - systems/databricks-fmapi-prompt-caching — FMAPI Prompt Caching (GA on open-weights models 2026-05-22). Implicit, volatile-only, tenant-isolated KV-cache reuse shipped as a default-on substrate property of the Foundation Model APIs; every layered product (Agent Bricks, Genie, AI Functions) inherits the win at no integration cost. Disclosed numerical signature on the GPT-OSS batch-inference rollout: +2.5× per-replica throughput, 3× P50 latency reduction at 30% cache hit ratio. Catalog: GPT-OSS 20B + 120B, Gemma 3 12B, Llama 3.1 8B (incl. PEFT-served fine-tuned variants), Llama 3.3 70B.
- systems/uc-managed-tables — Unity Catalog Managed Tables (first wiki disclosure 2026-05-14). Open-API-accessible managed Delta tables; Predictive Optimization + Liquid Clustering produce "up to 20× faster queries and 50% lower storage costs"; external engines (Apache Spark, Apache Flink, DuckDB) can now create, read, write, and stream to/from these tables via Delta Kernel; safety substrate is UC catalog commits (serialized commits + complete auditability + multi-table- transaction coordinator). Beta version pinning: Delta-Spark 4.2 + UC 0.4.1.
- systems/uc-credential-vending — Unity Catalog Credential Vending API (GA for tables 2026-05-14, Public Preview for Volumes). Mints short-lived, scoped credentials on demand for external engines; M2M OAuth replaces personal access tokens (named "per-user, long-lived, and hard to rotate"); engines auto-refresh via the vending API so "pipelines that run for hours complete reliably without tokens expiring mid-job." Same primitive extends to UC Volumes for unstructured assets (images / PDFs / videos). Canonical instance of concepts/credential-vending.
- systems/delta-kernel — open-source Java + Rust library for reading, writing, and committing to Delta tables (first wiki disclosure 2026-05-14). Abstracts the Delta protocol so "connector developers can focus on UC integration, not Delta implementation." Three named adopters: Apache Spark (via Delta-Spark 4.2), Apache Flink (via Delta Flink), DuckDB. Canonical instance of patterns/connector-library-as-protocol-abstraction — the ecosystem-growth lever that lets any engine integrate with Unity Catalog at low cost.
- systems/databricks-apps — the workspace-resident application runtime (first wiki disclosure 2026-05-13). Web app deployed inside the Databricks workspace; authenticates as a workspace service principal; queries Unity Catalog via the SQL Statement API; calls AI/BI Genie over the workspace REST API — "all on internal connections". Composes with systems/lakebase (operational app state, scale-to-zero) + UC (governed analytical data + ML audit substrate) + Genie (embedded NL query) + systems/mlflow (model + attribution versioning) into the single-platform application architecture thesis. Reference implementation: systems/site-feasibility-workbench (FastAPI + React, ~30 min deployment; clinical-trial site selection under FDORA 2022 + 21 CFR Part 11 + ICH E6(R3) + FDA GMLP). Forward roadmap: three additional Databricks Apps (Patient Cohort and Recruitment, Enrollment Velocity Optimizer, Risk-Based Monitoring and Compliance) on the same shape.
- systems/site-feasibility-workbench — first public open-source Databricks App (released 2026-05-13). FastAPI backend + React frontend; six-step guided clinical-trial site-selection workflow; TA-segmented LightGBM models trained on sponsor CTMS / EDC / IRT history; per-recommendation SHAP attributions written to UC-governed Delta tables (substrate + pattern). Saved shortlists persist to systems/lakebase; AI/BI Genie embedded for cross-domain NL-query against the same UC tables the ML models trained on. Reference implementation for concepts/single-platform-application-architecture and patterns/in-workspace-app-as-decision-support.
- systems/databricks-ai-functions — the SQL-callable LLM
inference primitive (
ai_query) used as the universal inference surface in the 2026-05-11 MapAid groundwater pipeline ingest. Multimodal input (page images), structured JSON output, and three load-bearing roles in one pipeline (classification / extraction / judge) without a separate model-serving service. Canonical instance of SQL-native multimodal LLM inference. - systems/databricks-foundation-model-api — managed multimodal model endpoint behind AI Functions; serves the full-page OCR + entity-recognition pass on water-flagged documents in the MapAid pipeline.
- systems/databricks-asset-bundles — declarative pipeline-packaging unit. The MapAid groundwater pipeline ships as one bundle deployable + runnable with one command; pipeline-vs-archive decoupling makes it portable to other domains (other water archives, regions, scanned-document corpora). Canonical instance of patterns/asset-bundle-single-command-deployment.
- systems/lakeflow-jobs — orchestration layer running the MapAid multi-stage pipeline on serverless compute.
- systems/unity-catalog-volumes — versioned, governed object storage for non-tabular assets (raw scanned PDFs/TIFFs/JPGs and per-page rendered images in the MapAid pipeline). First wiki disclosure of UC's Volumes face.
- systems/databricks-model-serving — the managed real-time
inference platform disclosed at platform-internals depth in
the 2026-05-08 Databricks / Superhuman joint post. Two-layer
co-engineered stack: platform layer (EDS-driven
Power-of-Two-Choices LB +
request_concurrency-based asymmetric autoscaler + lazy-loading container image) and runtime layer (FP8 quantisation with per-channel scaling + hybrid- precision toggle + multiprocessing RPC runtime for the CPU-bound regime + async CPU-GPU scheduler). Operating envelope (Superhuman workload): 200,000+ QPS peak, sub-1-second p99, 4-9's reliability with per-pod throughput 750 → 1,200 QPS on H100 (+60%) for a ~50/50-token-shape LLM. Canonical instance of the managed- serving-without-giving-up-control split: customer owns model + quantisation + quality bar; platform owns runtime + LB + autoscaler + image substrate. - systems/databricks-endpoint-discovery-service — the xDS
control plane Databricks built for intra-cluster Armeria RPC
(2025-10-01) and promoted to managed external inference at
200K+ QPS via the 2026-05-08 Superhuman migration. Watches
Kubernetes API for
Services+EndpointSlices, streams endpoint state to clients implementing P2C. Same control plane, expanded blast radius across two altitudes (intra-cluster service mesh + external inference). - systems/superhuman-grammar-correction-model — Superhuman's custom LLM serving real-time grammar / clarity / tone / style suggestions across the Superhuman productivity suite (Coda, Mail, Go) at peaks of 200,000+ QPS for 40M+ daily users with sub-1s p99 and 4-9's reliability. Canonical wiki instance of a small-fast-LLM at massive QPS — and therefore the canonical instance of the CPU-bound serving regime on H100. Pre-migration stack: DIY vLLM on L40S; post-migration: Databricks Model Serving on H100.
- systems/claroty-cps-library — Claroty's AI-Powered CPS Library, the asset-identity layer behind the xDome cyber- physical-systems protection platform (2026-05-13 disclosure). Catalog scale: 17 million+ assets consolidated as canonical CPS-IDs from noisy plant-floor evidence (protocol-derived model strings, vendor codes, firmware markers, OEM PDFs) into vulnerability records (CVEs / CISA advisories / NVD / CPE). Hybrid Entity Resolution architecture composing Databricks primitives: Medallion over Delta Lake with Delta CDF driving a versioned mapping registry; orchestrated multi-agent system (NLP / Reasoning / HITL) on Model Serving custom endpoints + Knowledge Assistant + Information Extraction; MLflow continuous evaluation via LLM-as-a-Judge against concept drift; Lakebase as transactional asset-mapping store with strict constraints; Apps
- Lakebase as the SME human-in-the-loop UI;
Lakeflow Jobs orchestrating CSAF
security-advisory ETL with
ai_queryper-step. Reported outcomes: 25% improvement in vulnerability identification accuracy; 56% of analysed devices receive new or updated security recommendations for previously-invisible outdated firmware. Canonical instance of patterns/hybrid-classical-er-plus-genai + patterns/orchestrated-multi-agent-entity-resolution. - systems/databricks-serverless-compute — the serverless Apache Spark product framed under the thesis "stability becomes a system property rather than a user responsibility" (2026-05-06). Composes three systems: Spark Connect (gRPC driver-client split), Serverless Gateway (three-signal workload-aware routing), and Serverless Autoscaler (two-axis adaptive scaling with OOM-aware VM restart). Scale: 25+ major Spark runtime upgrades per year at 99.998% success across >2 billion workloads (per SIGMOD/ PODS '25 "Blink Twice" paper). Customer outcomes: CKDelta 20 min vs 4–5 hr, Unilever 2–5× faster + 25% cost reduction, HP 32% savings + 36% runtime reduction. Canonical instance of concepts/stability-as-system-property at Spark's compute altitude.
- systems/spark-connect — the gRPC client-server rearchitecture of Spark's driver. "The most significant architectural transformation in Spark's history" — user application code no longer co-executes with the driver; queries travel as serialised logical plans over gRPC. Unit of execution shifts from processes to queries. Canonical instance of patterns/grpc-decoupled-driver-client at the Spark altitude.
- systems/databricks-serverless-gateway — workload-aware router using three real-time signals (logical-plan-derived query size + cluster utilisation + interactive-vs-batch latency profile) with continuous re-evaluation. Resolves the concepts/utilization-vs-predictability-tradeoff at the pool layer. Canonical instance of patterns/multi-signal-workload-aware-gateway-routing.
- systems/databricks-serverless-autoscaler — two-axis adaptive autoscaler (horizontal + vertical). OOM-aware VM- restart primitive detects task-level OOM and re-executes the task on a larger VM without job failure. Canonical instance of concepts/adaptive-oom-recovery + concepts/vertical-and-horizontal-autoscaling + patterns/oom-aware-vm-restart-autoscaling.
- systems/pantheon — Databricks' internal fork of CNCF Thanos, the TSDB at the core of Databricks' global monitoring stack. 160+ instances across ~70 cloud regions on 3 major clouds; ~5 billion active in-memory timeseries and >10 trillion samples/day. Two Receive groups with distinct memory-retention tiers (2h for persistent services, 30m for ephemeral serverless) matched to workload lifespan. Purpose-built 3-controller control plane (Rollout Operator / Hashring Controller / Autoscaling + Self-Healing Controller) remediating dozens of incidents per week. At-least-once block uploads (2 of 3 StatefulSets upload) cuts object-storage egress. Migration saved "millions of dollars in annual cloud costs" with "~5× reduction in monitoring infrastructure downtime." Upstream contributions to Thanos.
- systems/hydra — Databricks' lakehouse-native observability platform for raw unaggregated high-cardinality troubleshooting data. Spark Structured Streaming + Databricks Auto Loader ingest 20 billion active unaggregated timeseries into Delta Lake with ~5 min freshness. ~50× cheaper storage than Thanos. PromQL-to-SQL translation layer lets Grafana + existing dashboards query Delta tables unmodified. Unified metric semantics across both paths (Pantheon TSDB + Hydra lakehouse) — engineers don't distinguish.
- systems/telegraf — InfluxData OSS agent, deployed as the cardinality shield in front of Pantheon. >1 GB/s per region, thousands of aggregation rules, built on Dicer sticky routing (not Kafka) to preserve in-memory aggregator state across redeploys. Absorbed a 2-5× metric surge during an infra incident so Pantheon only saw a 20% surge.
- systems/dicer — Databricks' open-sourced (2026-01) auto-sharder. Dynamic slice-range sharding with hot-key isolation + replication, eventually-consistent Assignments, state transfer across reshards. Used by Unity Catalog, Softstore, the SQL query orchestration engine, and "every major Databricks product". 2026-05-05 addition: powers the Telegraf metric-aggregation tier above via patterns/sticky-routing-for-aggregator-state.
- systems/unity-catalog — unified governance service; Dicer-backed sharded in-memory cache drove 90–95% hit rate and drastic DB-load reduction. Also the hub of customer-facing data meshes — federates external Iceberg catalogs and exchanges data via systems/delta-sharing. 2026-05-13 GA disclosure: also the policy-evaluation engine for the organize → detect → protect governance pipeline, hosting three co-designed primitives (governed tags + ABAC policies + agentic data classification) inside one permission + metadata model.
- systems/unity-catalog-abac — ABAC policy primitive
(GA 2026-05-13). Evaluates tag-based conditions to apply
row filters + column masks automatically across catalogs/
schemas. 10K+ policies per metastore, 100+ per catalog/schema
(10× growth at GA). Session identity evaluation for
views/functions closes the view-as-bypass failure mode. Single
VARIANT UDF can mask
INT/DOUBLE/DECIMAL/STRUCTcolumns at once. Customer testimonials: Atlassian (operational- overhead reduction), Udemy ("Fewer policies, lower costs, surgical precision"). - systems/unity-catalog-governed-tags — account-level governed tag taxonomy (GA 2026-05-13). Tags attach to catalogs, schemas, tables, columns; inherit parent → child; CREATE/MANAGE permissions distinct from data ownership; full SQL DDL + REST + UI
- Terraform lifecycle. The attribute foundation ABAC policies evaluate against and the output substrate Data Classification writes into.
- systems/unity-catalog-data-classification — agentic classifier (GA 2026-05-13; custom classifiers in Beta). Pattern-recognition + metadata + LLM signals continuously tag sensitive columns. Built-in classifiers cover GDPR / HIPAA / GLBA / DPDPA / PCI plus UK / Germany / Australia / Brazil regional packs (India + Canada coming May 2026). Custom classifiers learn from already-tagged columns. Human-in-the-loop FP exclusion improves precision over time. Output substrate is the same governed-tag vocabulary humans use.
- systems/delta-sharing — open cross-cloud / cross-metastore / cross-partner data-exchange protocol. Used by Mercedes-Benz for three deployment shapes (cross-hyperscaler, cross-region, external partner) on one wire protocol.
- systems/delta-lake — Databricks' open table format; Deep Clone is the incremental-replication primitive behind patterns/cross-cloud-replica-cache.
- systems/softstore — distributed KV cache built on Dicer; canonical example of Dicer's state-transfer (~85% hit rate preserved through rolling restarts vs. ~30% drop without).
- systems/databricks-endpoint-discovery-service — custom xDS control plane watching Kubernetes services/EndpointSlices, feeding both Armeria RPC clients (internal) and Envoy ingress (external) off one source of truth.
- systems/armeria — shared Scala RPC framework; host of embedded client-side LB + xDS subscription code.
- systems/storex — internal AI-agent platform for database debugging across the global fleet; central-first sharded architecture
- DsPy-inspired tool framework + snapshot-replay validation with judge LLMs.
- systems/dspy — Databricks-sponsored programmatic prompt framework; cited as inspiration for Storex's tool/prompt decoupling.
- systems/mlflow — Databricks-originated ML lifecycle platform;
hosts Storex's
judgesprimitive and prompt-optimization tooling. - systems/lakebase — Databricks' serverless Postgres (Neon lineage, 2025 acquisition); Pageserver + Safekeeper durable storage, ephemeral Postgres compute VMs. 2026-04-20 CMK rollout ingested; 2026-04-27 first production deployment (LangGuard) ingested.
- systems/langguard — runtime enforcement layer for agentic workflows, profiled 2026-04-27 as one of the first startups on Lakebase. Intercepts every agent tool/data/model invocation and returns allow/deny/modify synchronously. Built by the IBM QRadar (SIEM) team; canonical articulation of "database architecture is destiny" for bursty security- telemetry-shaped workloads.
- systems/grail-data-fabric — LangGuard's patent-pending governance engine; live knowledge graph of workflow behavior + context backing runtime policy evaluation. Runs on Lakebase.
- systems/pageserver-safekeeper — the Neon-lineage page + WAL durable storage tier Lakebase inherits.
- systems/aws-kms / systems/azure-key-vault / systems/google-cloud-kms — the three cloud KMSes Lakebase's Customer-Managed Keys feature integrates with.
- systems/unity-ai-gateway — productised AI-gateway for coding agents + MCP governance (launched 2026-04-17). Three pillars: centralised audit in Unity Catalog, single-bill cost control via Foundation Model API + BYO external capacity, OpenTelemetry → UC-Delta-table observability. Clients ready at launch: Cursor, Codex CLI, Gemini CLI, with Claude Code via MLflow 3 tracing.
- systems/databricks-foundation-model-api — first-party inference for OpenAI/Anthropic/Gemini/Qwen underneath Unity AI Gateway; BYO external capacity supported.
- systems/databricks-join-order-agent — UPenn-collaboration
research prototype (2026-04-22): frontier LLM agent as an
offline join-order tuner for the Databricks query engine.
Single tool (
execute_plan), 50-rollout budget, grammar- constrained structured output, best-of-N selection. On the JOB benchmark over 10× scaled IMDb: 1.288× geomean speedup, 41% P90 drop — outperforming perfect cardinality estimates, smaller LLMs, and the classical BayesQO baseline. - systems/join-order-benchmark-job — the canonical 113-query IMDb-based academic benchmark Databricks' join-order agent is evaluated on; scaled 10× via row-duplication for the Databricks experiment.
- systems/bayesqo — Bayesian-optimization-based offline query-plan optimizer (Postgres-oriented prior art) used as the comparison baseline.
- systems/databricks-glow — Databricks' open-source distributed genomics toolkit on Spark (VCF / BGEN / PLINK → Delta tables). Named 2026-04-22 as the genomics-modality ingestion tool inside the lakehouse-as-multimodal-substrate pattern.
- systems/lakeflow-spark-declarative-pipelines — Databricks'
declarative streaming-ETL layer;
@dp.table+@dp.materialized_viewdecorators onpyspark.pipelinesfor streaming tables with schema evolution + late events + continuous aggregation. Named 2026-04-22 as the wearables- streaming tool inside the same pattern. - systems/mosaic-ai-vector-search — Databricks' managed vector search over governed Delta tables; indexes imaging-derived feature embeddings for similarity queries ("find similar phenotypes within glioblastoma") without moving vectors out of Unity Catalog's governance domain.
- systems/databricks-autocdc — declarative
CDC /
SCD API inside
Lakeflow SDP:
dp.create_auto_cdc_flowwithkeys,sequence_by,apply_as_deletes,stored_as_scd_typeparameters. Replaces 40–200+ lines of hand-rolledMERGElogic with ~6–10 lines of declarative definition. Supports CDF sources, SCD Type 1 / Type 2, and snapshot-diff inference. Runtime gains since Nov 2025: 71% perf-per-dollar SCD Type 1, 96% SCD Type 2. Adopters include Navy Federal Credit Union (billions of events/day), Block, Valora Group. Named 2026-04-22. - systems/databricks-genie-code — Databricks' AI-assisted
pipeline-generation product. Positioned 2026-04-22 as the
LLM-codegen client that produces AutoCDC declarations rather
than raw
MERGElogic, so AI-generated pipelines inherit AutoCDC's bounded-correctness envelope. - systems/databricks-genie — Databricks' state-of-the-art data agent for natural-language analytics over enterprise data (structured + unstructured). 2026-04-29 disclosed empirically via Trinity Industries (>1,000 questions/month, three-stage adoption curve). 2026-05-08 disclosed at architecture-level: three named advances (specialised knowledge search with up-to-40% table-discovery benefit; parallel thinking via multi-trajectory sampling + aggregation as the structural response to the verifiable-test gap; Multi-LLM with per-sub-agent assignment + GEPA- optimised prompts). Headline: 32% → over 90% accuracy vs "a leading coding agent" on Databricks' internal benchmark, simultaneously on accuracy + cost + latency. Operates via the four-phase trajectory (discovery → investigation → self-correction → verification). Architecturally distinct from coding agents — see concepts/data-agent-unique-challenges.
- systems/gepa-prompt-optimizer — prompt-optimisation method (arXiv 2507.19457) referenced by Genie's Multi-LLM advance as the per-(LLM, sub-agent) prompt optimisation tool that closes accuracy gaps for smaller / faster sub-agent models. First wiki citation 2026-05-08.
- systems/apache-datasketches — Apache Software Foundation library of production-grade probabilistic data structures (KLL, Theta, approx top-K, Tuple). Databricks exposes it as first-class SQL / DataFrame / Structured Streaming aggregates in the 2026-04-29 launch, backing dashboards / analytics workloads that accept 1–2% relative error in exchange for orders-of-magnitude compute reduction. Community contribution: Christopher Boumalhab implemented the Theta + Tuple sketch function families in upstream Apache Spark.
- systems/databricks-postgres-cli — the
databricks postgres generate-database-credentialcommand family inside the broader Databricks CLI; mints scoped, short-lived OAuth JWTs for Lakebase endpoints. Canonical first wiki integration datum 2026-04-30: Thoughtworks Backstage POC used a 50-minute cron wrapping this command to bridge Backstage's long-lived-credential expectation against Lakebase's short-lived-JWT auth posture. - systems/backstage — Spotify's open-source Internal Developer Portal framework (CNCF incubating); its Postgres- default + Knex-migration + notoriously-fragile-schema shape makes it the representative state-heavy-application migration-stress-test for Lakebase. Thoughtworks POC 2026-04-30 migrated a Backstage install off standard Postgres onto Lakebase to demonstrate cheap branching + PITR at 63 MB catalog / 1.09-second branch / 3.78-second recovery.
- systems/thoughtworks-technology-radar — Thoughtworks' twice-yearly industry-guidance publication; endorsed Backstage as IDP foundation, motivating the 2026-04-30 Backstage-on- Lakebase POC.
Key patterns / concepts¶
- patterns/shard-replication-for-hot-keys — isolate a hot key into its own slice, replicate that slice across N pods (Dicer's answer to concepts/hot-key).
- patterns/state-transfer-on-reshard — migrate per-key state across pods during assignment changes so caches survive rolling restarts (Softstore).
- concepts/dynamic-sharding — continuously-adjusted Assignment driven by health + load signals; Dicer's core primitive.
- concepts/static-sharding / concepts/hot-key / concepts/split-brain — the three structural failure modes Dicer was built to replace.
- concepts/eventual-consistency — Dicer's Assignment consistency model; trade-off vs. Slicer / Centrifuge leases.
- concepts/soft-leader-election — key-affinity-as-coordinator, one of Dicer's named use cases.
- patterns/proxyless-service-mesh — mesh capabilities (discovery, L7 LB, health-aware routing, zone-affinity) via shared library instead of sidecars. Rejected Istio / Ambient Mesh explicitly.
- patterns/power-of-two-choices — the default LB algorithm embedded in Armeria clients.
- patterns/zone-affinity-routing — with capacity/health-driven spillover.
- patterns/slow-start-ramp-up — introduced after client-side LB surfaced cold-start issues on fresh pods.
- patterns/tool-decoupled-agent-framework — DsPy-inspired; tools-as-functions with docstrings, prompts and tools swap independently.
- patterns/snapshot-replay-agent-evaluation — production-state snapshots replayed through candidate agent configs, scored by a judge LLM — Storex's regression harness.
- patterns/specialized-agent-decomposition — per-domain agents (DB, traffic, …) collaborating on root-cause analysis.
- patterns/hackathon-to-platform — 2-day prototype → user-feedback iterations → platform; Storex's on-ramp.
- concepts/client-side-load-balancing — overall architectural posture for internal RPC.
- concepts/layer-7-load-balancing — why they left kube-proxy.
- concepts/xds-protocol — the control-plane-to-data-plane contract, used beyond sidecar meshes here.
- concepts/control-plane-data-plane-separation — EDS vs. client/Envoy.
- concepts/tail-latency-at-scale — the observed symptom that motivated the proxyless redesign.
- concepts/central-first-sharded-architecture — Storex's foundation: global coordinator + regional shards for data-residency + one auth model across 3 clouds / hundreds of regions / 8 regulatory domains.
- concepts/llm-as-judge — scoring primitive inside Storex's validation framework.
- concepts/data-mesh — Unity Catalog + Delta Sharing position as the Databricks-opinion answer to the mesh shape, with domain-owned data products + central governance + open exchange protocol.
- concepts/hub-and-spoke-governance — the UC governance posture; central catalog + federated spoke data, one policy surface across clouds/regions.
- concepts/cross-cloud-architecture / concepts/egress-cost — the forcing functions behind the Mercedes-Benz mesh design.
- patterns/cross-cloud-replica-cache — Delta Sharing + Delta Deep Clone as the canonical shape for bulk cross-cloud consumers.
- patterns/chargeback-cost-attribution — bytes-at-the-sync-tier → producer-side billing dashboard; the governance hygiene layer on top of the replica-cache pattern.
- concepts/envelope-encryption — the three-level CMK → KEK → DEK hierarchy the Lakebase CMK rollout articulates cleanly.
- concepts/cmk-customer-managed-keys — customer-held root of trust model; Lakebase realises it across both its storage and compute tiers.
- concepts/cryptographic-shredding — Lakebase revocation semantics: unwrap fails → data cryptographically inaccessible; Manager terminates compute VMs.
- patterns/per-boot-ephemeral-key — Postgres compute VMs generate a per-boot ephemeral key that dies with the instance; pairs with concepts/stateless-compute scale-to-zero.
- concepts/compute-storage-separation — Lakebase's Pageserver+ Safekeeper vs ephemeral Postgres compute split is a canonical OLTP-shape instance (cf. Aurora DSQL, Snowflake).
- concepts/coding-agent-sprawl — named problem class: engineering orgs simultaneously running Cursor + Codex + Claude Code + Gemini CLI + … ; Databricks itself is the stated example.
- concepts/centralized-ai-governance — three-pillar framing (security/audit + single bill + Lakehouse observability) that Unity AI Gateway instances, paralleling Cloudflare's internal-stack shape with different substrates.
- patterns/ai-gateway-provider-abstraction — Unity AI Gateway is the Databricks instance, specialised for coding-agent + MCP governance.
- patterns/central-proxy-choke-point — architectural posture: all coding-tool traffic funnels through one gateway, no second path to providers.
- patterns/unified-billing-across-providers — Foundation Model API as first-party default + BYO external capacity = one bill + per-developer (not per-tool) budgets.
- patterns/telemetry-to-lakehouse — coding-tool OpenTelemetry → UC-managed Delta tables; joinable with Workday / PR-velocity / capacity-planning data.
- concepts/join-order-optimization — the decades-old query-planner subproblem the 2026-04-22 research prototype targets.
- concepts/cardinality-estimation — named as "as difficult as executing the query itself"; the weakest leg of the three-component optimizer decomposition.
- concepts/llm-agent-as-query-optimizer — the architectural pattern the Databricks/UPenn agent canonicalises.
- concepts/offline-query-tuning-loop — the human-DBA- workflow shape LLM agents can automate.
- concepts/anytime-optimization-algorithm — the algorithmic family rollout-budgeted agents belong to.
- concepts/exploration-exploitation-tradeoff-in-agent-search — per-rollout allocation decision inside the agent.
- concepts/like-predicate-cardinality-estimation-failure —
canonical
LIKE-predicate blind spot illustrated by JOB query 5b. - patterns/llm-agent-offline-query-plan-tuner — full pattern: one tool, rollout budget, grammar-constrained output, best-of-N.
- patterns/structured-output-grammar-for-valid-plans — grammar constraints ensure every rollout lands on a semantically-legal plan.
- patterns/rollout-budget-anytime-plan-search — bound search by rollout count → monotonic best-so-far, time-budget knob.
- patterns/governed-delta-tables-per-modality — the named remedy to the specialty-store-per-modality anti-pattern: every modality (genomics, imaging, notes, wearables) lands in governed Delta tables under one Unity Catalog surface; modality-specific tooling layers above the substrate rather than defining a separate stack per modality.
- patterns/fusion-strategy-selection-by-deployment-reality — pick multimodal fusion (early / intermediate / late / attention) from deployment-reality axes (modality availability, dimensionality balance, temporal dynamics); late fusion is the "safe start" default when missingness is expected.
- concepts/missing-modality-problem — "missingness isn't an edge case — it's the default"; Databricks canonicalises sparse-modality deployments as a first-class sysdesign concern.
- concepts/modality-masking-during-training — training-time regulariser that drops modality inputs to simulate deployment sparsity.
- concepts/early-fusion / concepts/intermediate-fusion / concepts/late-fusion / concepts/attention-based-fusion — the four fusion strategies each paired with a deployment-reality trigger.
- patterns/declarative-cdc-over-hand-rolled-merge — declare
CDC / SCD semantics (keys, sequence column, delete predicate,
SCD type); runtime implements ordering, dedup, version
management, idempotency. Canonical 2026-04-22 pattern: AutoCDC
over hand-rolled
MERGE. - concepts/snapshot-diff-inference-cdc — third CDC ingest mode on the wiki (after log-based and CDF); the runtime infers deltas from consecutive whole-table snapshots. First-class input mode in AutoCDC.
-
concepts/out-of-sequence-cdc-event-handling —
sequence_bycolumn as the declarative primitive for CDC ordering independent of arrival; canonicalises a concern separate from idempotency. -
concepts/probabilistic-data-structure / concepts/mergeable-sketch — Databricks' 2026-04-29 sketch-functions launch canonicalises mergeability as the architectural unlock that turns sketches from "a faster percentile" into a storage primitive; every sketch in the launch is mergeable across partitions, time windows, and streaming micro-batches.
- concepts/kll-quantile-sketch / concepts/theta-sketch /
concepts/approximate-top-k-sketch / concepts/tuple-sketch
— the four new first-class SQL / DataFrame / Structured
Streaming aggregates from the 2026-04-29 launch, each
replacing a specific exact-query failure mode (global sort,
full ID shuffle, cluster-wide reduction, composed
GROUP BY). - concepts/decision-support-vs-audit-query — the architectural classifier the 2026-04-29 post introduces explicitly to scope where sketch primitives apply. "When to use sketches: Dashboards, trend analysis, monitoring, marketing attribution. When to stay exact: Financial auditing, compliance reporting."
- patterns/precomputed-sketch-column-in-delta-table — the intended workflow for the 2026-04-29 launch: build sketches once during ETL, store as BLOB columns in Delta tables, merge on read. Converts a weekly-trending dashboard from a billion-row scan into a 168-sketch merge.
-
patterns/set-algebra-on-theta-sketches — audience overlap / incrementality / exclusive reach via union / intersection / difference on kilobyte-scale Theta sketches; microsecond set operations where exact computation requires a cluster-wide shuffle of user IDs.
-
concepts/compute-storage-separation — Lakebase's Pageserver+ Safekeeper vs ephemeral Postgres compute split is a canonical OLTP-shape instance (cf. Aurora DSQL, Snowflake). Second production datapoint via LangGuard (2026-04-27): burst-driven agentic workload exploits the "compute attaches with no data movement" property for scale-to-zero without data cold-start.
- concepts/agentic-workflow-governance — runtime control infrastructure for autonomous agent workflows; canonicalised via the LangGuard 2026-04-27 profile. Visibility-gap framing (autonomous-agent logic generated on the fly bypasses conventional SIEM-shape audit).
- concepts/runtime-policy-enforcement — synchronous allow/deny/modify gate before action execution; the LangGuard control primitive.
- concepts/agent-behavioral-baseline — learned baseline of agent behavior from historical trace data, used for anomaly detection; stated roadmap at LangGuard profile time.
- concepts/bursty-query-pattern — LangGuard's agentic-workload trace+enforcement traffic extends this concept to the combined write-burst + read-burst shape distinct from the OLAP read-burst framing.
- concepts/scale-to-zero — canonically articulated by LangGuard team as the QRadar-era missing capability for bursty security-telemetry workloads.
- concepts/database-branching — Lakebase's instant copy-on-write branching exploited by LangGuard for governance policy testing — first wiki instance of the primitive at policy-validation altitude.
- concepts/copy-on-write-storage-fork — Lakebase/Neon lineage as second canonical wiki instance after Aurora blue/green; governance-policy-testing as a new use case axis.
- patterns/runtime-governance-enforcement-layer — the shape LangGuard canonicalises at agentic-workflow altitude; inline gate on every agent action backed by live knowledge-graph context.
- patterns/policy-testing-via-database-branching — clone production trace data via copy-on-write branching in seconds, test new governance policies against real agent behavior in isolation, discard branch.
- concepts/agent-provisioned-database — the 2026-04-29 Lakebase launch-partner announcement for Stripe Projects makes Databricks the second launch-side provider in the agent-provisioning protocol. Lakebase/Neon becomes the canonical first- instance of this new concept — a database-tier sibling of concepts/agent-provisioned-account with sub-350 ms provisioning + scale-to-zero + copy-on-write branching as the three-pillar substrate contract.
- patterns/partner-managed-service-as-native-binding — Databricks/Neon via Stripe Projects is the third known-use instance of this pattern (after Cloudflare/PlanetScale + Fly.io/Tigris) and the first agent-as-customer instance with a payments-platform orchestrator rather than a compute- platform orchestrator.
- concepts/point-in-time-recovery — canonically disclosed at Lakebase altitude 2026-04-30 with 3.78-second end-to-end recovery for a 32-row deletion incident. On a copy-on- write-capable substrate, PITR collapses into a copy-on-write storage fork at a past timestamp; same operation as branching with a different time parameter.
- concepts/wal-record-granularity — PITR target-times snap backward to the nearest WAL record (12-second snap-back demonstrated on Lakebase POC); a structural property, not a bug, but load-bearing for time-sensitive recovery.
- concepts/mock-object-maintenance-cost — the 20-30%-of- test-code maintenance burden Thoughtworks argues cheap Lakebase branching eliminates. Test infrastructure that diverges from production + produces false confidence.
- concepts/integration-tests-against-real-database — the testing discipline cheap branching re-enables: full real-DB behaviour in tests (constraints, transactions, query-planner, lock ordering) rather than mocked surrogates.
- concepts/oauth-jwt-short-lived-credential — Lakebase's
data-plane auth posture; classic Databricks PATs are rejected,
short-lived OAuth JWTs minted by
databricks postgres generate-database-credentialare required. - patterns/branching-is-pitr-with-time-now — architectural
unification canonicalised at Lakebase altitude 2026-04-30:
branching and PITR are the same primitive with different
source_branch_time. Same control-plane call, same storage substrate, same compute-attach step; latency envelopes confirm (1.09 s vs 3.78 s, same order of magnitude). - patterns/database-branch-per-test-over-mocking — CI / QA / IDE workflow pattern that replaces database-interface mocks with per-test / per-PR / per-developer database branches when branching is sub-second + free. Canonical instance: Thoughtworks Backstage POC on Lakebase.
- patterns/credential-refresh-cron-as-auth-compat-shim —
pragmatic pattern for bridging short-lived-JWT data planes to
applications expecting long-lived credentials; Thoughtworks POC
used a 50-minute cron rewriting
DATABRICKS_TOKENin.env. - systems/lakehouse-federation — UC's query-federation
surface that exposes external systems as foreign catalogs
governed by UC. First wiki canonical instance (2026-05-15):
Backstage Postgres on Lakebase exposed as
lakebase_bsforeign catalog; standard UC GRANTs replace Postgres native grants, removing the operational↔analytical security-paradigm split. - systems/lakebaseops — open-source Thoughtworks-built
Databricks App for Lakebase DBA automation. Three agents
(Provisioning, Performance, Health) replace 51 historical DBA
tickets; seven scheduled Databricks
Jobs replace pg_cron; monitoring UI surfaces live
pg_statmetrics + slow-query regressions + branch TTL enforcement + 9-KPI adoption dashboard; migration wizard scores ten source engines (Aurora, RDS, Cloud SQL, AlloyDB, Cosmos DB, others) with live AWS+Azure pricing. Inherits governance from UC GRANT - audit trail. Repo:
github.com/suryasai87/lakebase-ops-platform. - systems/lakebase-mcp — open-source Thoughtworks-built MCP
server exposing 46 tools to MCP-capable AI agents (Claude,
Copilot, GPT) for Lakebase Postgres access. Dual-layer
governance: SQL-statement guard + per-tool access guard
across four pre-built profiles (
read_only,analyst,developer,admin) mapping onto the same UC GRANT model; "a coding assistant runs asread_onlyand physically cannot drop a table." Per-statement tool-tag attribution makes "which agent on which branch generated the 4 AM CPU spike?" a one-SQL query. Repo:github.com/suryasai87/lakebase-mcp.
Recent articles¶
- 2026-06-12 — sources/2026-06-12-databricks-enabling-evolutionary-database-development-database-branchin-part3 (Tier-3; passes scope: team-scale database branching architecture — tier topology as long-running branches, permission model with policy-enforced governance, DBA-to-platform-engineer evolution, SCM state machine with blocking gates for agent governance, TDD layer with per-role agents. Introduces systems/lakebase-app-dev-kit.)
- 2026-06-11 — sources/2026-06-11-databricks-ingesting-the-milky-way-petabyte-scale-with-zerobus-ingest (Tier-3; passes scope: deep streaming architecture with dynamic partitioning internals, zero-copy protobuf decoder (Zeroparser, ~1 GB/s/core, Rust, OSS), latency-optimized WAL, petabyte-scale benchmark at 12 GB/s sustained / 1.04T rows / 2,048 streams)
- 2026-06-10 — sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model (Tier-3; passes scope: model-serving infrastructure architecture with two-axis autoscaling internals, cold-start mitigation via warm pools, operational numbers at 300K+ QPS / 99.99% availability / 10→10K QPS in <60s, architectural comparison of request-based vs resource-based scaling)
- 2026-06-05 — sources/2026-06-05-databricks-enabling-evolutionary-database-development-database-branchin (Tier-3; passes scope: Part 2 of the Evolutionary Database Development series — CI/CD workflow architecture with GitHub Actions templates, new patterns: expand-and-contract, A/B variant prototyping, destructive testing; idempotent migration as hard requirement; DBA role as async PR reviewer)
- 2026-06-03 — sources/2026-06-03-databricks-apache-spark-real-time-mode-for-gaming
(Tier-3; passes scope: streaming architecture with stateful
processing,
transformWithStateoperator internals, production numbers at 4M sessions / 500K events-min / 432ms p99, architectural comparison vs Flink and custom actor systems) -
2026-06-01 — sources/2026-06-01-databricks-debunking-8-data-layout-myths-why-liquid-clustering-outperfo (Tier-3 Databricks Blog post; passes scope decisively despite consultative-listicle shape because the technical content is dense — transaction-log-based pruning mechanism, per-file min/max statistics powering both file skipping and metadata- only operations, OPTIMIZE engineering improvements with specific 12h→23min planning numbers, three production case studies with concrete numbers, structural critique of Z-Order rewrites, row-level vs file-level concurrency distinction, co-clustered join shuffle elimination). Debunking 8 data layout myths: why Liquid Clustering outperforms partitioning. Most architecturally dense Liquid Clustering disclosure on the wiki — each of eight defender-of-partitioning arguments paired with the verbatim reality, plus three production case studies at PB scale. Eight first-class wiki primitives canonicalised: concepts/over-partitioning (failure rate disclosed: "more than 75% of cases" on the Databricks customer base; three mistake classes — high cardinality, wrong column, over-fine granularity); concepts/file-level-data-skipping (load-bearing architectural fact: "directory-pruning does not exist on modern open table formats like Delta and Iceberg" — pruning is per-file via transaction-log statistics); concepts/metadata-only-operation (DELETE / COUNT / DISTINCT / GROUP BY computed from per-file min/max stats; ~90% faster metadata-only DELETEs, up to 27× aggregate speedups); concepts/row-level-concurrency ("two writers updating different rows no longer conflict, even if those rows live in the same file"; partition-as-write-boundary is a workaround for an older concurrency model); concepts/z-ordering (the older clustering technique with two structural problems — "poor clustering quality" + "unnecessary rewrites" — superseded by Liquid Clustering); concepts/multi-dimensional-clustering (clustering on
(date, hour, source, id)simultaneously, impossible under partitioning's cardinality limits); concepts/co-clustered-join (Private Preview shuffle elimination; ~51% faster, 87% less shuffle data on a real benchmark); concepts/low-cardinality-clustering-optimization (per-file=single-low-cardinality-value layout with high-cardinality nested sort; 35% lower clustering time, 22% faster query times benchmark). Four new patterns: patterns/clustering-keys-as-engine-input ("Liquid treats clustering keys as input that the engine uses to guide optimal file organization. Keys can be changed at any time" — the layout-as-implementation-detail thesis); patterns/incremental-clustering-on-write ("Liquid clusters incrementally, including at write time, so the layout stays optimal without unnecessary rewrites" — bounds maintenance cost to new-data volume rather than table size); patterns/in-place-partitioned-to-clustered-conversion (ALTER TABLE .. REPLACE PARTITIONED BY WITH CLUSTER BY— Private Preview, validated by Bolt's zero-downtime CDC migration); patterns/replace-using-and-replace-on-for-selective-overwrite (REPLACE USING / ON layout-agnostic, compute-agnostic — works on any layout and any compute, unlike partitioning-tied Dynamic Partition Overwrite). One new system page: systems/arctic-wolf-security-telemetry-table — production 3.8+ PB security telemetry table ingesting 1+ trillion events per day, 90-day queries 51s → 6.6s (7.7×) post-migration to Liquid Clustering on UC managed tables with Predictive Optimization. Three production case studies with operational numbers: Arctic Wolf (3.8+ PB; 1T+ events/day; 7.7× query speedup; file count 4M → 2M; data freshness hours → minutes); Bolt (TB-scale CDC; +138% write throughput; −21% avg / −63% max read time; zero downtime alongside live ingestion via in-place Liquid Conversion); Databricks-internal (1.1 PB → 0.8 PB / −27% storage; 5.9× wall-clock speedup; 86% bytes- read reduction across 16 representative production queries after re-clustering on(date, hour, source, id)from partition-by(date, hour)). OPTIMIZE engineering disclosure: planning phase 12h → 23m on 10 PB tables (31×); execution phase 5× faster on Medium DBSQL clusters — the engineering work behind Liquid Clustering's PB-scale viability. Forward-looking: co-clustered joins (Private Preview, "~51% faster, 87% less shuffle data" on Liquid-to-Liquid joins) + in-place Liquid Conversion (Private Preview SQL surface disclosed). Existing pages extended: systems/liquid-clustering gains comprehensive Eight myths debunked + Production case studies at PB scale + OPTIMIZE engineering improvements + Forward: co-clustered joins sections; systems/delta-lake gains a twelfth face (transaction-log-based-pruning) as the load-bearing architectural fact; systems/databricks-predictive-optimization gains the PB-scale role disclosure (Arctic Wolf attribution); concepts/automatic-table-optimization gains the OPTIMIZE engineering improvements; concepts/write-amplification gains Z-Order's periodic-rewrite cost geometry as a wiki-canonical instance at the lakehouse table-layout altitude. Caveats: consultative-listicle framing for a Tier-3 source; the 75% over-partitioning rate is unscoped (corpus / methodology not disclosed); benchmark numbers (22% / 35%) cite "a real-world data warehousing benchmark" without naming the benchmark; Arctic Wolf attribution between Liquid Clustering / Predictive Optimization / UC managed tables / Delta is not separately quantified; co-clustered joins and in-place Liquid Conversion are Private Preview without GA timelines; Z-Order critique is asymmetric (post elides cases where Z-Order remains workable); no QPS / concurrency / lock-contention numbers for the named customers. Frontmatteringested:flippedfalse → true. -
2026-05-29 — sources/2026-05-29-databricks-enabling-evolutionary-database-development-database-branching-with-lakebase (Tier-3 Databricks Blog — Part 1 of a three-part Evolutionary Database Development series; passes scope despite narrative- driven shape because Tier-3 source names a real production substrate, frames Practice #4 as the constraint that the substrate lifts, names a public open-source IDE extension, and discloses the four-validation CI flow). Enabling Evolutionary Database Development: database branching with Lakebase. First wiki canonicalisation of evolutionary-database-design as a discipline — Martin Fowler's 2003 essay → Pramod Sadalage's 2006 Refactoring Databases with 70+ named refactorings → Humble & Farley's 2010 Continuous Delivery Chapter 12 ("Managing Data") → 2026 Lakebase substrate change. Load-bearing methodology argument: Practice #4 — "everybody gets their own database instance" has stayed aspirational for twenty years because per-developer production-shaped databases cost time, money, and DBA cycles; the compensating layer that emerged (mock objects, in-memory DB substitutes like H2/SQLite, shared staging environments, DBA ticket queues) "became foundational methodology by default, not by design." Verbatim canonicalising claim: "In 2026, copy-on-write database branching arrives in Databricks Lakebase. A one-second, zero-storage-at-creation branch of a terabyte-scale production database is now an O(1) operation. The constraint that kept Practice #4 aspirational has lifted." Three load-bearing properties of the developer-DB instance canonicalised: fast (created when needed) + realistic (same Postgres engine, same governance, production-shaped data) + isolated (experiments don't interrupt anyone) — each historical compensating-layer alternative violates at least one. Re-uses Fowler's 2003 protagonist Jen + the Split Column refactoring as the worked example — "Same Jen. Same refactoring. What changed is the capability." CI flow canonicalised verbatim: "CI does what Jen just did, but for the team: it creates its own temporary Lakebase branch, applies the migration, runs the application test suite, runs database tests against the migrated schema, validates the migration itself (applies cleanly, idempotent, reversible), and posts a
schema-diffcomment on the PR showing exactly which database objects changed." Four-validation bundle (applies cleanly + idempotent -
reversible + application tests) is the substrate change that absorbs the breakage-class question previously held by the DBA. DBA reframe canonicalised as the role-evolution payoff: "the DBA can review on their schedule, not Jen's... improve the solution around data integrity, indexing strategy, future extensibility or long-term maintainability, not on the protective gatekeeping that used to take all their time." Migration tools cited as platform-agnostic: Flyway, Liquibase, Alembic, Knex, Prisma — substrate change is orthogonal to the tool ecosystem. New named system: systems/lakebase-scm-extension — public open-source VS Code / Cursor IDE extension at github.com/databricks-solutions/lakebase-scm-extension that synchronises a developer's git branch with a matching Lakebase database branch and surfaces the Branch Diff Summary view. Forward references: Part 2 (Jen's New Playbook — copy-on-write internals + methodology optimisations); Part 3 (Jen's Team at Scale — 50-developer governance + agent-in-the-loop + DBA re-deployment); Companion: Plugin Walkthrough (Lakebase SCM Extension end-to-end); Lakebase App Dev Kit for agents with companion ebook. New wiki entities: 4 concepts (concepts/evolutionary-database-design, concepts/practice-4-everybody-gets-their-own-database-instance, concepts/database-development-compensating-layer, concepts/dba-as-design-collaborator); 3 patterns (patterns/per-developer-database-branch-paired-with-code-branch, patterns/ci-ephemeral-database-branch-with-schema-diff-comment, patterns/migration-script-travels-with-application-code); 1 system (systems/lakebase-scm-extension). Existing pages extended: systems/lakebase gains the evolutionary-database-development capability slice as the newest section under "Capabilities surfaced so far"; concepts/database-branching gains a fifth canonical use-case axis (after PlanetScale schema-change, LangGuard governance-policy-testing, Stripe Projects agent-operation, Backstage state-heavy-application IDE/CI/QA) — methodology- substrate fit; concepts/copy-on-write-storage-fork gains a fifth canonical instance with the methodology arc; patterns/database-branch-per-test-over-mocking gains a second canonical instance pairing the mock-replacement axis to the Practice #4 substrate-shift framing. Sibling-cluster cross-refs: positioned as methodology-arc complement to the Backstage POC (substrate disclosure) + the Backstage Part 2 (governance composition) — same Lakebase branching primitive, three different framings (technical POC; governance composition; methodology arc). Sibling to LangGuard (governance-policy-testing axis) and Stripe Projects (agent-operation axis) — the methodology arc unifies all three under "what becomes operational when Practice #4 is finally affordable." Sibling to PlanetScale's branch-based schema-change workflow as the earlier production-realisation of the same constraint-lift (PlanetScale 2021 vs Databricks 2026); PlanetScale puts the branching at the deploy-time layer with a deploy-request + queue + traffic-aware-throttler, the Lakebase flow puts it at the IDE / CI / per-PR layer with same-PR migration scripts. Caveats: narrative-driven shape; no new architectural disclosure beyond prior Lakebase posts (the methodology arc is the contribution); no SCM Extension internals; no multi- developer scaling disclosure (deferred to Part 3); no agent- substrate detail (deferred to App Dev Kit); no quantitative cost / billing detail beyond "zero storage at creation"; tool list (Flyway/Liquibase/Alembic/Knex/Prisma) is platform- agnostic by design but means engineers don't get tool-selection guidance. Frontmatter
ingested:flippedfalse → true. wiki/index.md auto-regenerated by build watcher (Distilled +1; Concepts +4; Systems +1; Patterns +3; Companies 40 unchanged; Latest additions list will prepend the new source). -
2026-05-29 — sources/2026-05-29-databricks-databricks-at-sigmod-2026 (Tier-3 Databricks Blog conference-announcement post — passes scope despite announcement-shape because each named architectural claim is real and consequential, with arXiv-level paper backing). Databricks at SIGMOD 2026. Short corporate-blog post that nevertheless discloses the first publicly named architecture of Databricks' incremental-view-maintenance engine — Enzyme — which powers the materialized-view track of Spark Declarative Pipelines (SDP). Two papers announced: SIGMOD 2026 honorable-mention "Enzyme: Incremental View Maintenance for Data Engineering" (arXiv:2603.27775; presented by Ritwik Yadav) and VLDB 2026 "A Decade of Apache Spark Structured Streaming: How We Evolved the Architecture To Meet Real-world Needs". First wiki disclosure of the SDP two-track model: "There are two ways to write incremental programs in Spark Declarative Pipelines (SDP), and customers can mix-and-match these within a pipeline." (a) Enzyme for declarative
@dp.materialized_view; (b) Structured Streaming for explicit stateful operators / watermarks / custom aggregations. MV-as-ETL thesis: "Materialized views (MVs) are popular for query acceleration… When creating SDP, we decided to go beyond query acceleration and apply materialized views to the extract-transform-load (ETL) use cases. Our key observation is that if MVs can be efficiently and incrementally maintained, it will significantly simplify ETL workloads which otherwise require writing complex custom code." Enzyme's four novel claims over prior industrial IVM: (1) full MV-grammar coverage — joins -
window functions + aggregations + combinations, all incrementally maintained in one engine; (2) non-deterministic functions —
current_date()and AI functions handled correctly under incremental maintenance, where most prior industrial IVM rejects them or recomputes in full; (3) multi-language MVs — Python + SQL, with change detection on Python MV definitions as a named open problem solved; (4) cost-model-driven incrementalisation — runtime choice between partition-level vs row-level updates, selective intermediate-result caching, cost model fed by plan information + prior execution statistics. Performance disclosure: relative speedup chart vs "another competing industry solution (name anonymized to CV-IVM due to licensing restrictions)"; absolute numbers / workload axes / ablations deferred to the paper. Bangalore named as "a large Databricks R&D hub" and SIGMOD 2026 host city. New systems (1): systems/enzyme-ivm (also disambiguates against the unrelated Airbnb React testing utility of the same name). New concepts (4): concepts/incremental-view-maintenance (parent technique, full page); concepts/materialized-view (foundational concept, full page); concepts/non-deterministic-mv-maintenance (the hardest IVM challenge, full page); concepts/multi-language-materialized-view (Python + SQL MVs, full page). New patterns (1): patterns/cost-model-driven-incrementalization-strategy (sibling to Spark AQE's plan-rewriting shape, applied to IVM rather than query execution). Extended pages (3): SDP gains a two-track-architecture section + new "Seen in" entry (now the third source, the canonical architecture-level reference); Structured Streaming gains VLDB 2026 forward-reference + new "Seen in" entry (first wiki source naming Structured Streaming's academic publication); Apache Spark gains an academic-publication face noting the two-paper SIGMOD/VLDB pair as the first wiki disclosure that modern Databricks Spark contributions are published as named academic artefacts at top systems venues. Sibling-cluster cross-refs: complements prior SDP-naming sources (multimodal, AutoCDC) by revealing the engine inside the decorator — those sources used@dp.materialized_viewas a black box; this source names the IVM engine and its capabilities. Sibling to Octopus Energy MHHS on the trust-the-optimiser axis: AQE for query execution there, Enzyme cost-model for IVM strategy here. Caveats: announcement-shape, not architecture deep-dive; no absolute numbers (the only chart is relative speedup vs anonymised CV-IVM); mechanism for non-deterministic-function correctness not disclosed; Python-MV change-detection technique not disclosed; cost-model feature set not enumerated; relationship to Catalyst not described; workload-class boundaries (≥3-table joins, recursive CTEs, foreign vs managed Iceberg) not characterised; CV-IVM identity withheld under licensing. Frontmatter: rawingested: false → true. Return contract:ingested: wiki/sources/2026-05-29-databricks-databricks-at-sigmod-2026.md. -
2026-05-28 — sources/2026-05-28-databricks-advancing-apache-iceberg-on-databricks-iceberg-v3-ga-open-sharing-and-unified-governance (Tier-3 Databricks Blog feature-roundup post — passes scope on shape-canonicalisation grounds despite the marketing-roundup framing because each named primitive is a real architectural element with consequences for the OTF ecosystem). Advancing Apache Iceberg on Databricks: Iceberg v3 GA, Open Sharing, and Unified Governance. Coordinated set of Iceberg capability releases that reposition Unity Catalog as a fully Iceberg- native catalog. Format-level GA: Iceberg v3 reaches GA on Databricks across managed / foreign / UniForm-enabled tables — three primitives ship together: deletion vectors (file-level row- delete representation; merge-on-read applied to deletes), row tracking (stable per-row identity supporting more efficient incremental processing), VARIANT type (standard representation for semi-structured data). All three already existed on the Delta side; v3 brings parity, with cross-format compatibility called out verbatim. Catalog-side primitives: Managed Iceberg (GA) — UC creates / reads / writes / governs Iceberg tables directly with Predictive Optimization + Liquid Clustering applying. Foreign Iceberg (GA) + Credential Vending for Foreign Iceberg (GA) — UC governs Iceberg tables managed in eight named external catalogs (AWS Glue, Snowflake Horizon, Hive Metastore, Apache Polaris, Salesforce Data Cloud, Google Cloud Lakehouse, Palantir, Workday) while leaving data in place; mints short-lived scoped credentials. External Sharing to Iceberg clients (GA) — Delta Sharing now emits Iceberg REST endpoints; recipients on Snowflake / Trino / Flink / Spark consume shared data via Iceberg-compatible clients without ingestion or copies. Plus External Sharing of Foreign Iceberg tables (Public Preview). Cross-engine ABAC (Beta) — UC ABAC policies evaluate during server-side scan planning via the Iceberg REST Catalog Scan Planning API (Iceberg 1.11). The catalog returns a filtered scan plan; the engine reads only authorised data. Compatible engines: Spark / DuckDB / any engine implementing the Iceberg-1.11 scan-planning client. Wiki's first canonical instance of scan-planning-as-policy-enforcement-point — extends UC ABAC beyond the Databricks-compute boundary. Iceberg-compatible materialized views (Gated Public Preview) — managed MVs exposed downstream as native Iceberg tables; syntax
CREATE MATERIALIZED VIEW my_mv USING ICEBERG. Forward-looking: Iceberg v4 + Delta 5.0 alignment on a shared "adaptive metadata tree" metadata structure (concepts/format-co-evolution-iceberg-delta) — wiki's first canonical disclosure of explicit OTF-format-convergence direction. New systems (2): systems/iceberg-v3 (canonical v3 milestone page); systems/iceberg-rest-catalog-scan-api (the Iceberg-1.11 scan-planning client surface). New concepts (5): concepts/deletion-vector, concepts/row-tracking, concepts/variant-type, concepts/cross-engine-abac, concepts/foreign-iceberg-table, concepts/format-co-evolution-iceberg-delta. New patterns (1): patterns/scan-planning-as-policy-enforcement-point. Extended (5 existing pages): systems/apache-iceberg (gains Iceberg v3 GA section + catalog-side-primitives section + Seen-in entry — 15th Iceberg face); systems/unity-catalog (gains 6th face — fully-Iceberg-native catalog with five concurrent surface-area expansions); systems/delta-sharing (gains bi-format recipient surface section + Seen-in entry — recipients on Iceberg clients now consume the protocol); systems/delta-lake (gains 11th face — format-co-evolution- with-Iceberg, with verbatim Delta 5.0 disclosure and cross- format-compat for v3 features); systems/unity-catalog-abac (gains cross-engine-ABAC Beta Seen-in entry — first wiki disclosure of UC ABAC moving beyond Databricks compute); concepts/attribute-based-access-control (third canonical instance — cross-engine axis); concepts/credential-vending (gains foreign-Iceberg face — cross-catalog credential vending). Tier-3 marketing-roundup framing acknowledged throughout; no quantitative numbers anywhere in the source; mechanism depth on any single primitive deferred to spec / docs. Architectural significance is consolidation of named entities under one catalog-vendor surface area, not deep mechanism disclosure. -
2026-05-27 — sources/2026-05-27-databricks-reliable-llm-inference-at-scale (Tier-3 Databricks inference platform team — Marius Seritan, Cyrielle Simeone, Andy Zhang, Yu Zhang, Nick Lanham — passes scope on distributed-systems-internals / production-LLM-serving- architecture / multi-tenant-capacity-management grounds). Reliable LLM Inference at Scale. Production architecture behind Databricks' multi-tenant LLM-serving platform processing 125T+ tokens/month of frontier-model traffic. Six structural contributions: (1) Axon named publicly for the first time — the LLM data-plane router built on Dicer — sitting between rate-limiting and the inference runtime. (2) Model units as the LLM-request-cost abstraction: `cost ≈ α·input + β·output
-
γ·...` with β > α (decode > prefill), coefficients per (model, hardware) via auto-benchmarking, prefix-caching + multi-modality modifiers. The unit-of-account that lets Databricks offer VM-equivalent capacity guarantees instead of best-effort capacity. (3) Cost-based load balancing via patterns/cost-based-load-balancing-llm — Dicer-keyed-on-MUs replaces P2C-with-active-requests at LLM scale (verbatim retire-P2C datum: "LLM latencies tend to be high, server counts are lower than scaled out CPU systems, and the cost of misrouting is severe"). (4) Cost-based autoscaling via patterns/model-units-utilization-autoscaling — MU utilisation ratio averaged across pods drives a model-agnostic scaling infrastructure. >80% GPU savings vs static-peak provisioning on bursty workloads. (5) Stateful sessions via patterns/stateful-llm-session-routing — workload requests pin to a Dicer-assigned pod subset for KV prefix-cache locality + bounded blast radius (two purposes, one primitive). (6) Runtime reliability via two named mechanisms: silent-hang detection through prioritised black-box health checks (highest scheduling priority, <5-min detect→kill→recover cycle, false liveness-probe failures several/week → zero); and the multimodal CPU bottleneck fixes — Torchvision over PIL (10× preprocessing speedup) + OMP_NUM_THREADS fix (avoid container CPU throttling from thread oversubscription, e.g. 192 host vCPUs → 12 container vCPUs). Combined: >3× RPS jump on same hardware. Customers named: Superhuman, YipitData, Fox Sports. Hosted models: frontier OS (Kimi, Qwen) + proprietary (OpenAI, Gemini, Claude). New systems (1): systems/databricks-axon. New concepts (6): concepts/model-units, concepts/model-unit-utilization-ratio, concepts/silent-hang-llm-server, concepts/multimodal-cpu-bottleneck, concepts/omp-num-threads-container-misconfiguration, concepts/multi-tenant-llm-capacity-allocation, concepts/non-uniform-llm-request-cost. New patterns (5): patterns/cost-based-load-balancing-llm, patterns/model-units-utilization-autoscaling, patterns/prioritized-black-box-health-check, patterns/torchvision-over-pil-image-processing, patterns/stateful-llm-session-routing. Extended: systems/dicer gains a third canonical Databricks face (LLM-router-substrate); systems/databricks-model-serving gains an LLM-specific architecture section structurally distinct from the EDS+P2C+request-concurrency Superhuman face; concepts/power-of-two-choices gains its first canonical retire-for-LLM-serving datum; concepts/request-concurrency-as-autoscaling-signal gains its next-evolution-to-MUs cross-reference. The platform's reliability story is now documented at platform-internals depth across two layers: the platform/runtime layer from the Superhuman post (2026-05-08) and the LLM-router/multi-tenant-capacity layer from this post.
-
2026-05-27 — sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures (Tier-3 Databricks Engineering reliability roadmap on Lakebase / Neon — passes scope on distributed-systems-internals / production-AZ-outage / serverless-Postgres-SLO grounds). Authors: Jasraj Dange, Hans Norheim, Stas Kelvich, John Spray. Reframes serverless-Postgres reliability for the agentic-workload era — "agents create 4× as many databases as humans do"; "starting tens of millions of databases every day". Six pillars: (1) stateless Postgres compute + zone-redundant storage default-on for all tiers — eliminates the hot-standby tax + 10s- of-minutes WAL-replay crash recovery (verbatim); (2) "control plane is the new data plane" — split a hot-path data-plane controller off the management plane; empirical signal: 90% of compute sessions for auto-suspending databases in Neon are <10 min; (3) Critical- path dependency minimisation via bare- metal pool + own vertical-autoscaling virtualisation layer + own zone-resilient storage — collapses the 5-link cloud- provider control-plane chain to one already-completed dependency; (4) Cell-based architecture with regional cell composition — canonical production-AZ- outage instance on 2026-05-08 us-east-1 thermal-event: "the cell-based architecture reduced the impact by roughly an order of magnitude" — ~13% of databases in the region affected (1/8 = 8 cells, 7 failed-over correctly + 1 imperfectly); (5) Failure simulation + injection with failpoints + SQLancer / SQLsmith correctness validators, escalating to whole-AZ network-partition drills with 30-second-or-better per-database outage target; (6) Per-database availability attainment as the SLO measurement substrate (vs fleet-aggregate); two-bar reporting (99.95% / 99.99%); disclosed 2026 H1 attainment table; five SLI menu including the serverless-specific database startup time. New systems (2): systems/neon (first dedicated page), systems/sqlsmith (first wiki disclosure). New concepts (6): concepts/control-plane-as-the-new-data-plane (the central reframe), concepts/zone-redundant-storage (the storage-tier property), concepts/critical-path-dependency-minimization (the cloud-provider-control-plane-bypass discipline), concepts/database-availability-attainment (per-database SLO shape), concepts/database-startup-time-sli (the serverless-specific SLI), concepts/whole-az-network-partition-simulation (the next-level drill), concepts/failpoint (the in-code injection primitive). New patterns (4): patterns/preallocated-bare-metal-pool-with-virtualization (the cloud-provider-control-plane-bypass primitive), patterns/separate-data-plane-controller-for-hot-path (the control-plane decomposition pattern), patterns/per-database-availability-attainment (the SLO measurement pattern), patterns/whole-az-network-partition-drill (the chaos drill). Extended (10+ existing pages): systems/lakebase (new reliability-roadmap section + frontmatter), systems/pageserver-safekeeper (new zone-redundant-storage disclosure), systems/sqlancer (first canonical explicitly-adopted instance), concepts/blast-radius (~13% / ~order-of-magnitude quantified production-AZ-outage instance), concepts/cell-based-architecture (canonical serverless-database regional-composition instance), concepts/control-plane-data-plane-separation (inversion- corner-case under agentic workloads), concepts/static-stability (fourth canonical instance at cloud-provider-control-plane-bypass altitude), concepts/stateless-compute (Postgres stateless-compute reliability framing), concepts/chaos-engineering (Lakebase three-altitude regime), concepts/availability-zone-failure-drill (sibling whole-AZ-partition shape), patterns/cell-based-architecture-for-blast-radius-reduction, patterns/continuous-fault-injection-in-production. Caveats: data-plane controller and whole-AZ partition drill are "in flight" not landed; cell-count for us-east-1 implicit not stated; vertically-autoscaling virtualisation layer linked but not detailed; April-2026 attainment dip (99.96 → 99.93; 99.81 → 99.75) unexplained.
-
2026-05-27 — sources/2026-05-27-databricks-bi-serving-pointers-maximizing-for-performance-and-tco (Tier-3 Databricks Engineering BI-serving walkthrough — passes scope on the substantive architectural-framings ground despite the vendor-walkthrough format). Frames the four-layer BI serving stack on the Databricks Lakehouse (physical storage → semantic layer → automatic materialization → DBSQL warehouse / caching tier) with the thesis that "each layer compounds the performance gains of the layer below it". Six new wiki pages:
- systems/databricks-metric-views (first wiki disclosure
as a distinct named primitive) — UC-resident
headless-BI semantic layer; define metrics ONCE, every consumer
(AI/BI Dashboards / Genie / SQL notebooks / third-party BI
tools) resolves the same
MEASURE()definition; semantic metadata (display_name/comment/synonyms) is the AI-grounding contract for Genie. - systems/databricks-predictive-optimization (first
wiki disclosure as a dedicated system page) — auto-
OPTIMIZE/VACUUM/ stats collection inline-during-Photon-writes plus existing-table back-fill; 22% average performance improvement in observed workloads; newCLUSTER BY AUTOextension to Liquid Clustering. - systems/databricks-sql-warehouses (first wiki
disclosure as a distinct system) — serverless auto-scaling;
two-tier cache hierarchy (disk cache + Query Result
Cache) for repetitive BI workloads; reflexive
system.billing.usage/system.query.historyobservability surface. - systems/databricks-ai-bi-dashboards (first wiki disclosure) — first-party Databricks dashboard surface; one of four named consumers of Metric Views.
- concepts/headless-bi-semantic-layer (first wiki canonicalisation) — "define metric ONCE, every consumer resolves the same"; AI-grounding via semantic metadata sibling to layered grounded context from Cloudflare Skipper.
- concepts/metric-view-materialization (first wiki
canonicalisation) — automatic pre-aggregation + incremental
refresh + intelligent query rewriting + transparent
routing; collapses three coupled artifacts (aggregate tables
- refresh pipelines + BI-tool query updates) into one governed primitive.
Two new concepts: concepts/automatic-table-optimization (the substrate-owned-OPTIMIZE / VACUUM / stats shape, beyond Databricks-specific framing); concepts/dbsql-caching-tiers (the two-tier disk-cache + QRC hierarchy); concepts/optimizer-statistics-as-skipping-substrate (the reframing that stats are not just plan-quality input but the substrate that makes data skipping possible).
Four new patterns: patterns/governed-metric-as-headless-bi-substrate (define metrics in the catalog, every consumer resolves the same); patterns/auto-materialized-aggregation-via-semantic-layer (enable materialization on the metric, no separate aggregate tables / refresh pipelines / BI-tool queries to maintain); patterns/managed-table-as-default-storage-layer (use UC managed tables across all medallion layers, not just Gold); patterns/query-rewrite-to-pre-aggregated-materialization (the optimiser-side counterpart pattern that makes transparent routing work).
Existing pages extended:
- systems/liquid-clustering gains the CLUSTER BY AUTO
Predictive-Optimization-driven automatic key-selection
disclosure + the "replaces static partitioning AND manual
Z-ORDER, redefinable without rewriting existing data"
framing.
- systems/uc-managed-tables gains the "foundation of
everything in the BI serving stack" framing + the
cross-medallion-layer recommendation.
- systems/unity-catalog gains a twelfth canonical face
— BI-serving / semantic-layer substrate hosting Metric
Views.
- systems/databricks-genie gains a Metric-Views consumer
role with semantic-metadata-as-prompt-engineering grounding.
- concepts/star-schema gains a Gold-tier BI-serving
canonicalisation alongside its prior LLM-pipeline-state
canonicalisation; names the platform-resident dimensional-
modelling primitives (PK / FK with RELY hint, identity
columns, CHECK / NOT NULL) and the Silver-Data-Vault →
Gold-star-schema layering recommendation.
Operational disclosures: the only quantitative figure is the 22% average performance improvement from Predictive Optimization in observed workloads (cited via a separate Databricks post). The article is otherwise architectural — no concurrency / scaling envelope, no per-tenant numbers, no multi-tenant isolation depth, no materialization freshness contract under high ingest, no query-rewriter coverage envelope. Open standard provenance: SPARK-54119 (Apache Spark Metric Views OSS implementation) + UC OSS support coming. Closing thesis: "Each layer compounds the performance gains of the layer below it." The compounding shape — physical-layer optimisation → fewer rows scanned at the materialization layer → fewer rows aggregated at the semantic layer → faster consumer queries — is the architectural payoff over per-tool BI semantic layers + hand- built aggregate tables + refresh pipelines.
Sibling to Cloudflare Town Lake / Skipper (both arrive at the same architectural thesis from opposite directions — Cloudflare's R2 Data Catalog composes managed-Iceberg-on-R2 + DataHub + Trino + Skipper at the data-platform layer; Databricks composes UC managed tables + Metric Views + DBSQL warehouses + Genie at the BI-serving layer; both expose semantic metadata as the AI-grounding substrate). Sibling to sources/2026-05-14-databricks-expanded-interoperability-with-unity-catalog-open-apis (both name UC managed tables as the foundation; the May-14 post is open-engine-access-focused, the May-27 post is BI-serving-stack-focused). - 2026-05-23 — sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reduction (Tier-3 Databricks customer-success post on the Octopus Energy margin data pipeline rebuild for the UK regulatory transition to Market-wide Half-Hourly Settlement — passes scope on production-architecture- internals + scaling-trade-offs grounds despite the customer-success framing. Volume-driver: 2 meter reads/customer/month → 48 reads/customer/day = 48× data-point increase for 8M+ customers, projecting +$1M/yr in unsustainable compute under the legacy monolithic-monthly-grain pipeline. The architectural diagnosis is the article's central concept — grain misalignment: "The legacy pipeline had been built around a single grain: monthly. […] Running all three through a single monolithic pipeline meant processing the entire dataset on every run, regardless of what had actually changed." The rebuild splits the monolith into three grain-aligned streams (Settlement HH for industry settlement / Half-Hourly for smart-tariff revenue (EVs, heat pumps, ToU) / Monthly for standard tariff) on a unified multi-terabyte multi-grain source-of-truth layer — orchestrated by a "Job of Jobs" parent-child pattern preserving each stream's independent tuning profile ("what works as a Spark optimisation for Settlement is not necessarily right for NHH"). The single highest-leverage optimisation: Delta CDF-based incremental processing of the upstream consumption layer — rows/run dropped 25 B → 300 M (98.8% reduction), freshness weekly → daily. Layered Spark/Delta optimisations: broadcast joins for reference tables under 500 MB (eliminating shuffle on multi-key joins with date ranges), liquid clustering on filter/join columns ("avoids the small-file problem, higher memory consumption, and I/O overhead that come from over-partitioning"), lineage simplification + early column / row pruning, and the counter-instinct fourth lever — removing custom optimisation code in favour of Spark AQE ("In several cases, Spark's Adaptive Query Execution (AQE) outperformed hand-tuned logic. The team removed custom optimisation code and let AQE do its job"). Databricks Serverless named as the development-velocity enabler — "The testing and development process could not have been done without serverless. Using the serverless UI helped us to identify bottlenecks and make easy comparisons between different runs" — zero cluster startup + side-by-side run comparison made the three-month delivery window viable. Final geometry: $0.48 per settlement date — 50× below the projected MHHS cost ($23.63), 2× below the legacy ($0.71) despite processing 48× more data points. ~$1M annualised cost avoidance excludes the upstream-incremental savings, which are additional. Three engineers, three months. Generalisation made explicit: "Any time a system moves from monthly to daily, daily to real-time, or aggregate to transactional, the same dynamics apply" — four transferable takeaways: grain misalignment is the hidden cost driver, incremental processing transforms pipeline economics, remove before you add, trust the optimiser. New systems (3): Octopus Margin Data Pipeline, Spark AQE (promoted from recurring-tag mention to dedicated page), Liquid Clustering (promoted from recurring-tag mention to dedicated page). New concepts (3): concepts/grain-misalignment, concepts/data-pipeline-grain, concepts/remove-before-add-optimization. New patterns (4): patterns/grain-aligned-stream-split, patterns/cdf-incremental-replacing-full-rescan, patterns/broadcast-join-for-small-reference-tables, patterns/job-of-jobs-orchestration. New company: Octopus Energy — UK retail energy supplier, 8M+ customers; first wiki disclosure. Extended: concepts/delta-change-data-feed gains its multi-terabyte upstream-substrate face (distinct from the Bronze→Silver promotion face from the Claroty source); systems/delta-lake, systems/apache-spark, systems/databricks-serverless-compute each gain a new "seen in" face. Saad Ali, Lead of the Margin Data Team at Octopus Energy: "You can't just throw more compute at a problem like this. You have to rebuild and rethink your logic from the ground up.").
- 2026-05-22 — sources/2026-05-22-databricks-observability-any-agent-anywhere-otel-unity-catalog
(Tier-3 Databricks blog GA announcement of OTel-format trace
ingestion direct to Unity Catalog Delta
tables — the agent-side OTel companion to the 2026-05-20
full-payload Inference Tables
substrate. Three new systems canonicalised:
Zerobus Ingest (managed serverless
OTLP/gRPC + REST receiver — "With a 'single-sink' architecture,
Zerobus Ingest simplifies observability by streaming data
directly to the lakehouse. Existing OLTP-compatible collectors
can point directly to this endpoint via gRPC, entirely bypassing
intermediate message buses like Kafka"),
UC OTel Trace Tables (six MLflow-
derived Delta views:
<prefix>_otel_spans,_otel_logs,_otel_metrics,_otel_annotations,_trace_unified,_trace_metadata— auto-liquid-clustered, MLflow per-experiment trace cap removed, unbounded storage), and MLflow OTel Tracing (the framework-side instrumentation surface —mlflow.<lib>.autolog() -
@MLflow.tracedecorator combined; provisions the UC tables from Python; agent-runs-anywhere portability "In fact the support assistant agent example that was used for this blog is deployed locally"). Three new concepts: concepts/single-sink-telemetry-architecture (the no-broker shape — managed receiver direct to Delta), concepts/instrumentation-storage-decoupling (OTel as protocol-portable boundary — "using the OTel standard to separate instrumentation from storage"), and concepts/production-traces-as-evaluation-substrate (durable prod traces become MLflow eval-dataset bootstrap — "these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to purely synthetic test cases" — same judges run continuously on live traces). Three new patterns: patterns/managed-otel-ingestion-direct-to-lakehouse (the Zerobus shape canonicalised), patterns/bootstrap-eval-dataset-from-production-traces (SQL-warehouse-driven dataset materialisation), and patterns/component-level-latency-from-otel-spans (per-tool P50/P99 dashboards over_otel_spans, finer than trace-level latency that native dashboards default to — "That tells us whether the LLM, a Genie tool call, or another step is the bottleneck"). Operational disclosures: 200 QPS starting ingest throughput (account-team escalation for higher), storage limit none, MLflow per-experiment trace cap removed, auto liquid-clustering post latest product update. Three SaaS-vs-lakehouse asymmetries argued verbatim: "retention economics" (object storage cheaper than SaaS), "the PII deadlock" (no third-party data egress — UC column masking + row filtering apply automatically because traces are governed Delta tables), and "analytics, not just telemetry" (joinable with business data). Customer scale points named: Experian (Eva + Latte agents — "hundreds of thousands of traces"), Superhuman/Grammarly ("hundreds of thousands of traces per day" — explicitly replacing a custom point solution: "that maintenance burden was a real pain point for our teams"), SmartSheet (two production agents in "three-day co-build", "tens of thousands of evaluations"), The Standard (insurance underwriting + claims agents). Reference agent: LangGraph + Databricks-hosted Claude Sonnet 4.6 + Genie tool over MCP, deployed locally. Cross-links extended: patterns/telemetry-to-lakehouse gains its third citation — agent-trace specialisation distinct from the 2026-04-17 metrics+traces face and the 2026-05-20 full-payload face; systems/mlflow gains a sixth+ face (OTel-tracing-direct-to- UC, with the prod-traces-as-eval-substrate flow); systems/opentelemetry gains the "protocol-portable boundary between agent instrumentation and lakehouse storage" face; concepts/lakehouse-native-observability extends from the 2026-05-05 metric-time-series face to span/log/metric agent- trace face; systems/inference-tables gains a sibling- substrate cross-reference distinguishing model-call-payload granularity (Inference Tables) from agent-execution-span granularity (UC OTel Trace Tables) — both UC-resident, both governed under one catalog.) -
2026-05-22 — sources/2026-05-22-databricks-how-world-bank-group-uses-databricks-to-eradicate-poverty-through-shared-knowledge (Tier-3 Databricks customer-success post on World Bank Group's Knowledge 360 + Data 360 unified-platform build. Architectural primitives: per-domain Genie instances each pinned to a metrics layer + RAG agent over UC Volumes + Vector Search + agentic router with three named classifiers (intent / domain / decomposer)
-
decoupled visualisation agent + AI Gateway as control plane. Two new architectural disclosures: (1) Genie's default LLM-only output is nondeterministic enough to be unfit for financial / operational reporting (Suresh Kaudi: "In the structured content, you need an answer. What is my bank balance? I don't want to see a different number every time") — drives the metrics-layer-for-deterministic-Genie-answers retrofit. (2) Each Genie is per-metrics-layer / per-domain, so cross- domain queries break single-Genie — drives the intent-domain- decomposer agentic-router fan-out shape. Operational scale: 3M document downloads / month through the AI-powered layer, half from low- and middle-income countries; external-feedback prototype built and deployed in ~2.5 days. Borderline-include Tier-3 ingest — mechanism-light throughout (classifier model choice, decomposition strategy, metrics-layer implementation substrate not disclosed) but the architectural shape and failure modes are specific enough to canonicalise. New patterns (2): patterns/intent-domain-decomposer-agentic-router (the three-classifier router with set-output-and-fan-out generalisation of multi-agent supervisor routing); patterns/metrics-layer-for-deterministic-genie-answers (the pin-Genie-to-metrics-layer retrofit for same-question-same-answer contracts). New company: companies/world-bank-group (canonical wiki home for the WBG knowledge platform; complements the existing Tier-3 Databricks-customer corpus). Cross-links added: sixth canonical face of Genie on the wiki; first Genie deployment fronted by an intent-domain-decomposer router. Composes with patterns/multi-agent-supervisor-routing (Virtue Foundation's alternative-selection sibling pattern) and patterns/transform-upstream-to-collapse-measures (Trinity Industries' upstream measure-consolidation complement). Confirms the 2026-05-20 Governing AI agents at scale scope-generalisation thesis (AI Gateway gates non-coding-agent populations too).)
-
2026-05-22 — sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models (Tier-3 Databricks blog post; short GA announcement of implicit prompt caching for open-weights models served on the Foundation Model APIs). Architectural news in three load-bearing design choices: (1) implicit — "customers do not need to configure anything, our system has built to automatically run the prompt caching and reuse to improve throughput"; (2) volatile-only, tenant-isolated, never persisted — the safety envelope that lets default-on caching ship on multi-tenant infrastructure without an encryption-at-rest threat model; (3) inherited platform-wide — Agent Bricks, Genie, AI Functions all inherit caching at no integration cost. Disclosed numbers from the GPT-OSS production batch-inference rollout: +2.5× per- replica input-token throughput, 3× P50 latency reduction at a 30% cache hit ratio. Catalog covered: GPT-OSS 20B + 120B, Gemma 3 12B, Llama 3.1 8B (including PEFT-served fine-tuned variants), Llama 3.3 70B. New system (1): Databricks FMAPI Prompt Caching — the named platform feature. New concepts (2): concepts/implicit-prompt-caching (the design choice that caching is platform-decided, zero-configuration, and contrasts with explicit caching APIs from Anthropic / OpenAI / Google); concepts/volatile-only-prompt-cache-isolation (the multi-tenant security shape composing tenant isolation + RAM-only residency + no persistence). New pattern (1): patterns/implicit-prompt-cache-as-platform-default (the default-on substrate-layer rollout pattern, contrasted with explicit-API integration). Cross-link added: companion to concepts/session-affinity-prompt-caching (Cloudflare's contrasting client-signal-driven design) and concepts/kv-cache (the underlying primitive being reused).
-
2026-05-20 — sources/2026-05-20-databricks-marketing-campaigns-with-lakebase (Tier-3 Databricks blog post; integration tutorial for SAP Engagement Cloud at Deichmann + architecture pitch for Lakebase as the canonical bursty-OLTP backend for marketing campaigns). Borderline-include: ~70% click-through tutorial, ~30% architecture, but the architecture content surfaces three genuinely-new wiki canonicalisations. New systems (2): Lakebase Synced Tables (managed Delta → Postgres materialisation with three sync modes — snapshot / triggered / continuous — and the load-bearing operational rule "when more than 10% of the data is updated, we recommend snapshot mode, which delivers 10x better performance than triggered mode"); Lakehouse Sync (the Postgres → Delta direction — "a native, continuous CDC-based pipeline from Lakebase Postgres to Unity Catalog Delta tables" — closing the bidirectional governed-data path between operational and analytical tiers without hand-maintained CDC pipelines). New concept (1): Lakebase Local File Cache (LFC) — first wiki disclosure of LFC as Lakebase's compute-VM-local cache of Pageserver pages, alongside the two Lakebase-specific Postgres query statistics
PREFETCH(prefetch requests issued/hit/wasted) andFILECACHE(LFC hits/misses) as the load-bearing observability layer for diagnosing storage-compute-boundary performance issues. New patterns (2): patterns/snapshot-sync-mode-for-batch-rebuild (the 10% / 10× rule of thumb generalised — choose snapshot mode when the per-cycle delta exceeds the implementation-specific crossover threshold, where bulk-copy efficiency overtakes per-row diff/merge cost; counterintuitive because snapshot rewrites the entire table, but for high-delta workloads the bulk-copy path wins by an order of magnitude on the disclosed Lakebase implementation); patterns/native-postgres-roles-for-non-databricks-aware-partners (OAuth's hourly token rotation incompatibility with partner systems like SAP Engagement Cloud forces a fall-back to native Postgres password roles with operator-managed rotation; explicit security-discipline-instead-of-mechanism tradeoff). Lakebase page extension: seventh canonical face for Lakebase added to systems/lakebase — the bidirectional governed-data plane between operational and analytical tiers, with LFC as the compute-side observability layer for the storage-compute boundary. Concrete sizing disclosure: bursty marketing-campaign workload at Deichmann uses scale-to-0 → 16 CU (~32 GB RAM) on Lakebase Autoscaling, with the architectural justification that "Lakebase autoscaling speed and reactivity eliminate the risk of resource underutilization" — sub-second scale-down makes generous max-cap sizing safe. Marketing-campaign customer segments positioned as the canonical bursty workload applied to OLTP rather than to observability databases (the prior canonical context). Standard Postgres tuning surface (pg_stat_statements,work_mem256 MB on larger compute,autovacuum_vacuum_scale_factorfor high-churn tables) confirmed unchanged. TLS chain: Let's Encrypt; partner systems must trust ISRG Root X1. -
2026-05-20 — sources/2026-05-20-databricks-governing-ai-agents-at-scale-with-unity-catalog (Tier-3 Databricks blog vision/positioning post extending the 2026-04-17 Unity AI Gateway launch from coding-agent scope to org-wide agent governance across every department — dev / analytics / sales-ops / support / marketing / finance). Borderline-include: vision-heavy with no scale numbers, no internals, but the four-pillar framing and five named architectural extensions (Service Policies, Inference Tables, Lakewatch, Guardrails, Budgets) are individually citable. Canonical four-pillar framing for agent governance ((1) Delegated access — three-layer permissions/Service-Policies/Guardrails composition; (2) Data-centric AI governance — "AI governance is data governance"; (3) Cost intelligence — usage-tracking + Budgets; (4) Open and interoperable — governance-travels-with-resources). New systems (5): Service Policies (UC functions attached to registered MCPs evaluating tool calls before execution; ternary
allow/deny/consent; fail-closed on deny; canonical instance of patterns/policy-as-uc-function-attached-to-mcp); Inference Tables (full payload of every model call — "the exact prompt sent, the exact response returned, token counts and latency" — written to UC-managed Delta tables, customer-controlled retention; canonical instance of patterns/inference-payload-table-for-audit); Lakewatch (Databricks' agentic SIEM built on the security lakehouse — first wiki disclosure; "Attackers are using agents. Defenders should too."); Unity AI Gateway Guardrails (inline content scanning of every model call — inputs for PII + jailbreak, outputs for hallucinations + sensitive content; fail-closed on every request; canonical instance of concepts/inline-llm-content-guardrail); Unity AI Gateway Budgets (per-user / per-group monthly spend thresholds with alerts; hard enforcement on roadmap). New concepts (4): concepts/four-pillars-of-agent-governance (the canonical four-pillar framing distinct from the 2026-04-17 three-pillar shape — adds delegated-access and open-interoperable as load- bearing additions); concepts/data-centric-ai-governance ("an agent's behavior is almost entirely determined by the data it has access to" → AI governance and data governance must be one system; the data-classification → tag → ABAC pipeline applies to agent traffic without AI-specific configuration); concepts/inline-llm-content-guardrail (canonical concept distinct from CI/quality-gate-based concepts/ai-agent-guardrails; bidirectional, inline, per-request, fail-closed scanning of LLM I/O); concepts/governance-travels-with-resources (Pillar 4 principle: "governance becomes a property of your platform rather than something you rebuild for each new framework or model" — same UC + AI Gateway across LangGraph / CrewAI / OpenAI SDK / Anthropic SDK / AutoGen / LlamaIndex; same gateway across Databricks-hosted / Azure OpenAI / AWS Bedrock / Anthropic). New patterns (3): patterns/three-layer-agent-control (the load-bearing composition for Pillar 1 — permissions key on identity, Service Policies key on tool-name+args+identity, Guardrails key on payload content; each layer's decision input is strictly weaker than what the next layer needs); patterns/inference-payload-table-for-audit (full request/response capture in lakehouse-resident tables — breaks the conventional completeness-vs-cost tradeoff that APM-style logging imposes); patterns/policy-as-uc-function-attached-to-mcp (catalog-managed policy code attached to the resource (MCP server), not the agent — so framework-agnostic; UC functions inherit UC's versioning + audit + ownership lifecycle). Extends 7 existing pages: systems/unity-catalog (eleventh face — AI-asset-governance substrate for "LLMs, MCP servers, skills, and agents"; first explicit end-to-end OBO disclosure — "identity flows… from the user who asks the question to the specific table row the agent retrieves" — with dual-identity audit logging); systems/unity-ai-gateway (org-wide-agent generalisation + five new architectural surfaces: Service Policies, Guardrails, Inference Tables, Budgets, Lakewatch substrate); systems/model-context-protocol (MCP servers now registered in UC as securables, governed with Service Policies + OBO); concepts/centralized-ai-governance (three-pillar → four-pillar comparison table); concepts/governed-agent-data-access (Databricks instance of Gallego's two-axis framing, with three-layer + four-pillar framings as more-granular siblings); concepts/coding-agent-sprawl (generalised to org-wide agent sprawl across every department; talent-flight as a third risk axis); patterns/on-behalf-of-agent-authorization (Databricks generalisation + "specific table row" granularity disclosure + dual-identity audit logging requirement); patterns/central-proxy-choke-point (generalised from coding-agent scope to org-wide agent scope; canonicalises the three-layer-agent-control composition the choke-point hosts); patterns/telemetry-to-lakehouse (full-payload specialisation via Inference Tables — distinct from but co-existing with the metrics+traces variant). - 2026-05-20 — sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries (Tier-3 Databricks-for-Good co-marketing post — Virtue Foundation's VF Match platform connects medical volunteers to opportunities in 72 low / low-middle income countries via a production-grade Foundational Data Refresh (FDR) pipeline on Databricks). Borderline-include: ~40% Databricks-for-Good framing in the bookend paragraphs but the Building the Foundation / Entity Resolution at Scale / VF Agent sections (~60%) name a specific architectural shape worth canonicalising — a multi-step LLM extraction pipeline over 25M+ web pages orchestrated by Lakeflow Jobs across 15+ interdependent tasks, with star-schema state + status-based checkpointing + a configurable extraction registry; an Entity Resolution stage built on the open-source Splink probabilistic record-linkage framework with a quantified curse-of-the-last-reducer observation (one Spark partition running 30 minutes vs 52-second median — ~35× ratio) reduced 15× to ~2 minutes by enabling Photon; and a prototype VF Agent multi-agent architecture in LangGraph routing user queries to Vector Search or Genie sub-agents via a Multi-Agent Supervisor. New systems (6): systems/vf-match (Virtue Foundation's volunteer-matching marketplace, the user-facing system); systems/vf-agent (prototype natural-language-query layer with four-sub-agent composition on LangGraph); systems/splink (UK-Ministry-of- Justice-origin Apache-2.0 probabilistic record-linkage framework implementing Fellegi-Sunter + EM-driven match-weight estimation, SQL-pluggable backend across Spark / DuckDB / others); first dedicated wiki page for systems/photon (Databricks' C++-native vectorised query engine, previously only mentioned in passing across multiple ingests); systems/langgraph (LangChain ecosystem's graph-based agent-orchestration framework — first wiki disclosure); systems/overture-maps + systems/bright-data (the two complementary data sources feeding FDR — Meta+Microsoft open-source geospatial authority
- commercial real-time web scraping). New concepts (6): concepts/multi-step-llm-extraction (the break-the-task-into- targeted-steps discipline at LLM-invocation altitude; "dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task"); concepts/status-based-llm-pipeline-checkpointing (per-record state-column primitive enabling 25M+-page pipeline resumability without re-paying LLM cost); concepts/star-schema (the fact-and-dimension data-warehousing schema as state model for LLM pipelines — first wiki canonicalisation of star-schema as the substrate for multi-step LLM extraction state); concepts/probabilistic-record-linkage (Fellegi-Sunter formal-statistical formulation Splink implements); concepts/vectorized-query-engine (the engine class Photon / DuckDB / Velox / ClickHouse all implement, with batch-of-rows SIMD column-major loops); concepts/curse-of-the-last-reducer (the canonical name for straggler-partition-dominates-wall-clock at the reduce stage of batch jobs). New patterns (2): patterns/multi-step-llm-extraction-pipeline (named pattern composing the multi-step + status-checkpointing + registry + star-schema sub-properties; first canonical wiki instance of the production-grade LLM-extraction pipeline shape distinct from the document-extraction sibling patterns/two-pass-classify-then-deep-extract); patterns/multi-agent-supervisor-routing (one supervisor classifies query intent + complexity and routes to one of N alternative-answer-shape sub-agents; distinct from Claroty's collaborative role-decomposition which has agents collaborate on one task — supervisor-routing has agents as alternatives selected per query). Updated (8 existing pages): concepts/entity-resolution gains Splink as the canonical open-source classical-ER framework (first open-source-ER instance on the wiki; complementary to Claroty's custom-built hybrid stack); concepts/partition-skew-data-skew gains the Spark / Photon batch-altitude instance with the 30 min → 2 min remediation (15× improvement quantification via vectorisation rather than partition redistribution); systems/lakeflow-jobs gains the third canonical face (FDR 15+-task multi-step LLM-extraction-pipeline orchestrator alongside MapAid groundwater + Claroty CSAF — the shape is converging across three independent customers); systems/databricks-genie gains the fifth canonical face (Genie-Agent-as-sub-agent) — Genie used not as a destination chatroom but as an internal subroutine in a larger query-orchestration graph, distinct from BI-replacement / data-agent-internals / embedded-NL-query / migration-handoff faces; systems/mosaic-ai-vector-search gains the Vector-Search-Agent-in-multi-agent-supervisor face; systems/apache-spark gains the LLM-extraction-pipeline + ER face (25M+-page extraction + Splink ER, with the curse-of-the-last-reducer observation pinning Photon's value at non-OLAP altitude — first wiki-quantified instance); companies/databricks tags + Recent articles extended. Operational numbers cited: 72 countries; 50,000+ patients delivered care to date; 25M+ web pages processed through GPT models; 15+ interdependent Lakeflow Jobs tasks; 30 min worst- case partition vs 52 s median (~35× ratio) before Photon; ~2 min worst-case partition after Photon; 15× improvement; 4 named sub-agents in VF Agent prototype. Architectural composition summary: the same Databricks-stack primitives that compose into MapAid groundwater (multi-step LLM extraction on scanned documents) and Claroty CPS (hybrid ER on industrial asset identity) compose into VF Match (multi-step LLM extraction on web pages + Splink ER) — the **multi-step-LLM-extraction
-
entity-resolution + multi-agent-query-layer** trifecta is the recurring architectural shape on Databricks for AI-powered data catalog construction. Caveats: vendor co-marketing post; VF Agent is a prototype; per-step accuracy / cost / latency numbers not disclosed; GPT model versions / per-step prompts not disclosed; only the Photon comparison is quantified.
-
2026-05-19 — sources/2026-05-19-databricks-deutsche-borse-zeppelin-to-databricks-notebook-migration (Tier-3 Databricks Blog, customer-co-authored with Deutsche Börse). Borderline-include launch-post-with-architecture: ~30% architecture density passes the AGENTS.md "Borderline cases — include, don't skip: Product launches THAT ALSO contain deep architecture sections" test, with named-primitive disclosure (paragraph-to-cell + interpreter-prefix mapping +
.ipynbJSON reformat as the deterministic stage; Genie + context-encoded prompt as the LLM stage; explicit negative space — "the converter does not rewrite SQL logic, Python logic, visualizations, widgets, Oracle and HDFS references, scheduling logic or business- specific custom code") and a concrete reusable architectural thesis: "separate structure from logic, apply the right tool to each." Forcing function: Cloudera Zeppelin EOL 2027. Created (8 new pages): 1 source + 1 company (companies/deutsche-borse) + 2 systems (systems/apache-zeppelin, systems/deutsche-borse-zeppelin-converter) + 3 concepts (concepts/notebook-format-migration, concepts/heterogeneous-code-migration, concepts/context-encoded-llm-prompt) + 2 patterns (patterns/structural-deterministic-logical-llm-split, patterns/context-encoded-prompt-handoff). Extended (2 systems): systems/databricks-apps gains a fourth canonical face — customer-built migration-tool substrate — distinct from clinical-ops decision-support / Claroty HITL-UI / DBA- automation; the App is itself a migration utility, not a destination workload, and runs inside the destination platform's workspace. systems/databricks-genie gains a fourth canonical face — migration-handoff: consumer of a context-encoded prompt emitted by a deterministic operator-side tool — distinct from BI-replacement / data-agent-internals / embedded-NL-query; pins Genie's effectiveness to upstream context-engineering discipline for a third time on the wiki (alongside Trinity measure- consolidation as load-bearing precondition and the 2026-05-08 rich semantic enterprise context as substrate). Operational results: hours-to-minutes per notebook (manual: hours each → hybrid: 15–20 min each), 2,000-user migration scope, business- user-self-service workflow with no per-notebook engineering team. Lessons-learned section also pins a notable counter- cyclical signal: a first-attempt agentic architecture was explicitly rejected in favour of a simple linear two-stage pipeline (Stage 1 → handoff → Stage 2), on the rationale that "a more complex agentic architecture added overhead without solving the core problem". Frontend stack disclosure: shadcn UI in production, evolved from a Streamlit prototype. -
2026-05-15 — sources/2026-05-15-databricks-backstage-with-lakebase-part-2 (Tier-3 Databricks Blog, Part 2 of the Thoughtworks Backstage- with-Lakebase series — Governance). Borderline-include: ~70% architecture density passes the AGENTS.md test, with named- primitive disclosure across Lakehouse Federation (operational Postgres exposed as foreign catalog
lakebase_bsin UC),system.access.audit(every Lakebase control-plane action), UC system billing tables (cost attribution by(project_id, branch_id, endpoint_id)with worked numbers31.6130 DBUprod /0.0107 DBUtransient test branch), branch-propagated masking policies (UC attribute-level masks inherit at branch creation), and two open-source Thoughtworks tools deployed as Databricks Apps: LakebaseOps (three- agent platform — Provisioning / Performance / Health — replacing 51 historical DBA tickets, 7 scheduled Databricks Jobs replacing pg_cron, 9-KPI adoption dashboard, ten-engine migration wizard with live AWS+Azure pricing) and Lakebase MCP (46-tool MCP server with dual-layer governance — SQL-statement guard + per-tool access guard across four profilesread_only/analyst/developer/adminmapping onto UC GRANT — plus per-statement tool-tag attribution). Load-bearing claim: "Because Lakebase is natively embedded inside Databricks, Unity Catalog extends directly over the operational Postgres database." The compliance side-channel (CloudTrail + pgaudit + CloudWatch) collapses into one SQL query against UC system tables. Created (10 new pages): 1 source + 3 systems (systems/lakebaseops, systems/lakebase-mcp, systems/lakehouse-federation) + 3 concepts (concepts/branch-level-cost-attribution, concepts/branch-level-governance-propagation, concepts/operational-analytical-governance-unification) + 3 patterns (patterns/foreign-catalog-federation-for-operational-db-governance, patterns/dual-layer-governance-sql-and-tool-guards, patterns/tool-tagged-query-attribution). Extended (5 pages): systems/lakebase (8th face — governance- substrate-unified operational DB), systems/unity-catalog (10th face — operational-DB governance via Lakehouse Federation), systems/backstage (2nd face — governance- substrate-unified IDP), systems/databricks-apps (3rd face — DBA-automation + AI-agent-DB-access deployment substrate), concepts/database-branching (governance-composition leg of Part 2). Eighth Lakebase face on the wiki. Cross-source continuity: direct sequel to Part 1 (Deployment Cycles) which canonicalised branching as developer-cycle primitive; Part 2 makes governance inseparable from branching. Companion to the 2026-05-13 UC ABAC GA — same UC ABAC primitives apply to the federated operational DB. Part 3 (FinOps) is forthcoming and explicitly previews "taking the infrastructure ownership data inside Backstage and joining it directly to cloud billing data in a single SQL query." -
2026-05-14 — sources/2026-05-14-databricks-expanded-interoperability-with-unity-catalog-open-apis (Tier-3 Databricks Blog launch post — Expanded interoperability with Unity Catalog Open APIs). Borderline-include: launch framing but ~40% architecture density passes the AGENTS.md borderline test, with named-substrate disclosure across catalog-managed commits, credential vending, and Delta Kernel. Two coordinated milestones: External Access to Managed Tables in Beta (Apache Spark / Apache Flink / DuckDB can create, read, write, and stream to/from UC managed Delta tables) and Credential Vending GA for tables / Public Preview for Volumes (M2M OAuth replaces PATs; engine-side auto-refresh closes the long-running-pipeline gap). Three new system pages (systems/uc-managed-tables — the open-API-accessible managed Delta table primitive with Predictive Optimization + Liquid Clustering + catalog commits; systems/uc-credential-vending — the credential-vending API with M2M OAuth + auto-refresh; systems/delta-kernel — the open-source Java + Rust library abstracting the Delta protocol so engines focus on UC integration; plus first wiki disclosure of systems/duckdb as a UC-integrated external engine). Four new concept pages (concepts/credential-vending — first wiki canonicalisation of the catalog-mediated short-lived-scoped-credential primitive; concepts/catalog-managed-commits — first wiki canonicalisation of the central commit-coordinator substrate that prevents log corruption + provides audit + enables multi-table transactions; concepts/external-engine-write-to-managed-table — first wiki canonicalisation of the architectural shape that resolves the managed-table-benefits + compute-engine-choice trade; concepts/m2m-oauth-vs-pat — first wiki canonicalisation of the auth-substrate comparison naming the three structural failure modes of PATs that M2M OAuth dissolves). Three new pattern pages (patterns/credential-vending-for-external-engine-access — auth-side deployment pattern with the two-layer short-lived- credential shape; patterns/catalog-managed-commits-for-external-write-safety — commit-side deployment pattern naming serialized commits + audit + multi-table transaction substrate as the three property guarantees; patterns/connector-library-as-protocol-abstraction — first wiki canonicalisation of the library-shape pattern with Delta Kernel as canonical instance, naming the two structural failure modes that protocol-abstraction libraries eliminate (drift across implementations + per-engine reimplementation cost)). Updated five existing pages: systems/unity-catalog gains a ninth canonical face (Open API + external-engine-write hub) with full architectural-primitives disclosure; systems/delta-lake gains a ninth canonical face (external- engine-managed-write substrate) — managed Delta tables now writeable by Spark / Flink / DuckDB via Delta Kernel under UC catalog commits; systems/apache-spark + systems/apache-flink gain UC-Managed-Table-writer faces; systems/unity-catalog-volumes composes via Volume Credential Vending (Public Preview). Forward-roadmap composition with ABAC for external reads: the post explicitly names ABAC for external reads as developed-but-not-yet-GA, which when shipped will compose the 2026-05-13 ABAC GA primitives with the 2026-05-14 external-access primitives — fine-grained row + column-level governance enforced uniformly across first-party and external read paths. PepsiCo testimonial (Sudipta Das, Director of Enterprise Data Operations) frames the customer payoff. Activation contract: preview-portal enrollment + metastore-level toggle + schema-level
EXTERNAL_USE_SCHEMAgrant + Delta-Spark 4.2 / UC 0.4.1 version pinning. -
2026-05-13 — sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library (Tier-3 Databricks Blog co-marketing post — Databricks GenAI MVP customer story on Claroty's AI-Powered CPS Library, the asset-identity layer behind Claroty's xDome CPS-protection platform). Borderline-include: heavy GenAI-MVP framing (~70%) but the Under the Hood / Data Engineering at Scale / Multi-Agent Intelligence / Innovation through Databricks Capabilities sections (~30%) name a specific architectural shape worth canonicalising — a hybrid Entity Resolution pipeline combining classical ER with an orchestrated multi-agent system (NLP / Reasoning / Human-in-the-loop), all on the Medallion Architecture over Delta Lake with Delta Change Data Feed driving a dynamic mapping registry, plus a real production observation about vector-search endpoints lacking scale-to-zero for bursty workloads. One new system page (systems/claroty-cps-library — Claroty's xDome asset-identity layer at 17 million+ asset catalog scale; CPS-ID positioned as "the new industry standard for cyber-physical system identity"). Three new concept pages (concepts/entity-resolution — first wiki canonicalisation of ER as an architectural problem class with the noisy-real- world-data-into-single-source-of-truth shape; concepts/delta-change-data-feed — first wiki canonicalisation of Delta CDF as the layer-transition trigger driving Bronze → Silver promotion via dynamic mapping-registry application; concepts/vector-search-no-scale-to-zero — first wiki canonicalisation of the explicit production-cost observation that hosted vector-search endpoints currently lack scale-to- zero, structurally analogous to concepts/gpu-scale-to-zero-cold-start but at the index altitude). Two new pattern pages (patterns/hybrid-classical-er-plus-genai — the central architectural shape combining battle-tested classic ER with GenAI cognitive parsing as complementary halves; patterns/orchestrated-multi-agent-entity-resolution — the three-role decomposition NLP / Reasoning / HITL with the feedback loop closing into model retraining). Five existing systems extended: systems/delta-lake gains a Delta-CDF
- schema-evolution + time-travel audit-chain face;
systems/lakebase gains the transactional ER store face
(sixth-or-later wiki face — Postgres constraints load-bearing
for asset-mapping data integrity); systems/databricks-apps
gains the HITL UI for Entity Resolution face composed
with Lakebase as state store; systems/databricks-model-serving
gains the substrate for hosting domain-specific embedding
models as custom endpoints face (third wiki face — beyond
the 200K-QPS Superhuman LLM-serving face); systems/mlflow
gains the continuous-production-monitoring substrate against
concept drift via LLM-as-a-Judge face; systems/lakeflow-jobs
gains the CSAF security-advisory ETL with
ai_query+ step-by-step Delta tables face; systems/databricks-ai-functions gains the second canonical instance (Claroty CSAF pipeline alongside MapAid groundwater pipeline). Three existing concept pages extended: concepts/medallion-architecture gains a third canonical instance (Bronze raw → mapping- registry-canonicalised silver schema for ER); concepts/llm-as-judge gains a third face (production- monitoring against concept drift in CPS data with conservative pass/fail/unknown ternary explicit against missing ground- truth); concepts/schema-evolution gains the audit-chain- enabler-with-time-travel face. One existing system extended: systems/unity-catalog gains an eighth wiki face — Entity-Resolution catalog governance, the audit-chain anchor connecting raw evidence → mapping-registry version → canonical CPS-ID → vulnerability attribution. Operational numbers cited: 17M+ assets in the global catalog; 88% of CPS assets do not transmit an exact product code; 76% transmit codes that differ from vendor records; 25% improvement in vulnerability identification accuracy; 56% of analysed devices receive new or updated security recommendations for previously-invisible outdated firmware; worked example "Rockwell Automation 1769-L36ERMS/B → Compact GuardLogix 5370 → CVE-2020-6998". Architectural composition summary: the same Lakebase + UC + Apps + Model Serving + MLflow primitives that compose into the clinical-operations decision-support shape (FPW / 2026-05-13 clinical-ops source) compose differently into the Entity Resolution catalog shape here — same primitives, different role-assignment (Lakebase is ER asset store not app shortlist store; Apps is SME HITL UI not clinician decision- support UI; Model Serving hosts custom medical/OT embedding endpoints not LLMs at 200K QPS). The recurring shape: a small set of Databricks platform primitives composing into very different vertical workflows by re-assigning their roles. - 2026-05-13 — sources/2026-05-13-databricks-clinical-operations-intelligence-belongs-on-the-lakehouse (Tier-3 Databricks Blog — open-source release of the Site Feasibility Workbench as a Databricks App for clinical-trial site selection, framed as a reference implementation of "clinical operations intelligence when the application, the models, and the data live on the same platform."). Borderline- include: ~30% architecture density (the Architecture Argument
- Auditability Argument sections name specific platform primitives and articulate a concrete architectural thesis), ~70% clinical-trial industry context. Two new system pages (systems/databricks-apps — first wiki disclosure of Apps as a deployment model; systems/site-feasibility-workbench — open-source reference implementation, FastAPI + React, ~30 min deployment) plus two new concept pages (concepts/single-platform-application-architecture — the unified-platform thesis that eliminates four integration layers (sync pipeline + credential surface + RBAC translation + semantic harmonisation); concepts/governed-shap-attribution-table — per-prediction SHAP attributions stored as governed UC Delta tables, versioned in MLflow, lineaged through UC, queryable in SQL) plus two new pattern pages (patterns/in-workspace-app-as-decision-support — the deployment shape; patterns/shap-attribution-as-governed-delta-table — the regulated-ML audit pattern). Architectural thesis: "Databricks Apps, Lakebase, and AI/BI Genie eliminate each of those layers — not by abstracting them away but by making them unnecessary." The single-platform shape is a four-primitive composition: Apps (web tier inside the workspace, service- principal auth, SQL Statement API path to UC, REST API to Genie, all internal connections); Lakebase (operational app state, scale-to-zero, workspace-credentialed); UC (governance + lineage
- RBAC the app inherits for free, plus the substrate for governed SHAP attributions); Genie (embedded NL-query layer). Sixth Lakebase face on the wiki (after CMK / LangGuard agentic-OLTP / Stripe-Projects agent-provisioning / Backstage state-heavy-app / FPW image-generation-pushdown): app-tier-state-store-without-its- own-credential-surface. Seventh UC face: data-plane half of the single-platform application architecture. New AI/BI Genie face: embedded NL-query layer composed via internal REST API, not a separate Genie-room product. New MLflow face: the versioning leg of the regulated-ML SHAP-attribution-Delta-table audit substrate (anchors per-prediction attributions to the exact model version, addressing 21 CFR Part 11 + ICH E6(R3) + FDA GMLP audit-chain integrity). New Delta Lake face: ML-audit substrate for governed per-prediction SHAP attributions with the three load-bearing properties (ACID + schema evolution + SQL queryability + time-travel). Architectural reframe of explainability as fairness control: per the source, "sponsors can audit recommendations for systematic under-weighting of community sites, minority-serving institutions, or first-time investigators — turning explainability into a fairness control" — the substrate enabler is the queryable-attribution population, not the per-prediction explainer service. Reference implementation deploys "into an existing Databricks workspace with Unity Catalog in approximately 30 minutes of technical deployment time." Forward roadmap (named in post): three additional Databricks Apps (Patient Cohort and Recruitment, Enrollment Velocity Optimizer, Risk-Based Monitoring and Compliance) on the same shape — "All four deploy as Databricks Apps. All four query Unity Catalog directly. None make external API calls."
- 2026-05-13 — sources/2026-05-13-databricks-abac-row-filtering-and-column-masking-policies-governed-tags
(Tier-3 Databricks Blog — GA announcement for three Unity
Catalog governance primitives co-designed into one
organize → detect → protect pipeline). Three new system pages
(UC ABAC, UC
Governed Tags, UC
Data Classification) plus four new concept pages
(concepts/governed-tag, concepts/agentic-data-classification,
concepts/separation-of-duties-data-governance,
concepts/session-identity-evaluation) and two new pattern pages
(patterns/tag-driven-attribute-based-access-control,
patterns/single-variant-udf-for-multi-type-masking)
canonicalise the GA primitives. Sixth wiki face for Unity
Catalog: not just where governance is recorded but where it
is expressed, evaluated, and enforced. Architectural shift is
from per-object configuration of row filters / column masks
("repetitive and prone to inconsistency") to declarative
tag-driven policy evaluation — one ABAC policy referencing
pii:ssncovers every column carrying that tag in scope, including columns added or tagged after policy authoring. Three GA scaling enhancements: (1) policy limits grew 10× across every scope (10K+ per metastore, 100+ per catalog/schema); (2) session identity evaluation for views and functions evaluates against the user running the query, not the view creator (closes view-as-bypass failure mode); (3) single VARIANT UDF can maskINT/DOUBLE/DECIMAL/STRUCTcolumns at once via type-erasure (patterns/single-variant-udf-for-multi-type-masking). Built-in data classifiers cover GDPR / HIPAA / GLBA / DPDPA / PCI plus UK / Germany / Australia / Brazil regional packs (India + Canada coming May 2026). Custom classifiers in Beta learn detection patterns from already-tagged columns + Unity Catalog metadata, with human-in-the-loop false-positive exclusion feeding back to improve precision. Separation-of-duties (concepts/separation-of-duties-data-governance) emerges from the three-permission split across the primitives: governance teams hold tag-taxonomyMANAGE/CREATE, stewards hold tagAPPLY, data producers hold tableOWNER— three roles operating on three permission axes without cross-team blocking. Architectural composition with prior wiki framings: (a) UC ABAC is ABAC at the table-storage governance altitude, distinct from the prior-canonical Convera + Cedar API-authorization altitude; (b) UC Governed Tags is the data-warehouse-catalog altitude of concepts/data-classification-tagging, distinct from Figma FigTag's application-schema altitude and Meta Policy Zones' runtime-IFC-annotation altitude; (c) the organize → detect → protect pipeline mechanically fills in the "data classification with governed tags + row/column-level controls" governance contract that the multimodal-healthcare ingest named for patterns/governed-delta-tables-per-modality; (d) broadens the principal surface Genie operates over (the GA post explicitly cites Genie as a driver — agents make per-table per-user wiring even less tractable, motivating tag-driven policy). Borderline Tier-3 (GA announcement, ~70% architectural / ~30% PR) — included because the governance-architecture content (organize → detect → protect with no handoff, ABAC over governed tags, separation of duties, session identity, VARIANT UDF type-erasure) is substantive even though implementation depth is light. Customer testimonials: Atlassian (Gerald Nakhle, operational-overhead reduction), Udemy (Rajit Saha, "Fewer policies, lower costs, surgical precision"), Superhuman (Nan Wu, on agentic Data Classification "replaces manual overhead with automated, high-quality results"). - 2026-05-11 — sources/2026-05-11-databricks-unlocking-the-archives
(Tier-3 Databricks Engineering / Databricks-for-Good post —
first wiki disclosure of
AI Functions (
ai_query) used as the universal inference primitive across three pipeline stages, plus first instance of Databricks Asset Bundles and Foundation Model API on the wiki). MapAid + SUDAAK groundwater archive: ~700 scanned PDFs / 5,570 pages of decades-old, mixed English/Arabic Sudanese geological surveys turned into a structured catalog + 299 well/borehole records via a multimodal classification + judge + extraction pipeline. Three architectural primitives disclosed: (1) Visual-First Document Extraction — page-as- image to multimodalai_query, OCR replaced by visual understanding (concepts/multimodal-document-understanding + patterns/visual-first-document-extraction). (2) Two-Pass Classify-then-Deep-Extract with Intelligent Sampling — sample title pages / intros / conclusions in pass 1 for >70% volume reduction; full-page OCR + entity-anchored linking + JSON extraction only on the ~50% water-flagged subset in pass 2 (patterns/two-pass-classify-then-deep-extract). (3) LLM-as-Judge as a First-Class Pipeline Stage — every classification scored on accuracy / completeness / consistency rubric with categorical rating + written justification (audit trail), sub-threshold rows routed to manual review (patterns/llm-judge-as-inline-pipeline-stage). Also canonicalises schema- constrained LLM output, [[patterns/sql-native-multimodal-llm- inference|SQL-native multimodal inference]], and Asset Bundle single-command deployment. First-run numbers: 654 docs / 5,570 pages in <3 hours, 95% rated excellent/good by inline judge, 299 structured records extracted. Borderline Tier-3 (Databricks-for-Good customer story) — included because the architectural primitives generalise. Eighth Databricks-platform face on the wiki. -
2026-05-08 — sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie (Tier-3 Databricks Engineering post — first mechanism-level disclosure of Databricks Genie's internal architecture beyond the product-name level, building on the 2026-04-29 Trinity Industries adoption case). Three named architectural advances: (1) Specialised Knowledge Search — Genie "uses the existing data assets such as workspace tables, notebooks, dashboards, documents, and files to derive a rich semantic enterprise context and then uses this context to construct a search index. It uses multiple search indices in parallel together with rich metadata signals." Disclosed result: "up to 40% improvement on table discovery benchmarks" (Figure 4). Canonicalised as concepts/specialized-knowledge-search + patterns/semantic-context-grounded-search-index over the rich semantic enterprise context substrate. (2) Parallel Thinking — sample multiple agent trajectories over the same query, aggregate findings; "significant accuracy improvement" (Figure 5) on GPT-5.4 + Opus-4.6 baselines. Canonicalised as concepts/parallel-thinking-trajectory-sampling + patterns/parallel-trajectory-sampling-and-aggregation, positioned as the structural compensation for the verifiable-test gap unique to data agents. (3) Multi-LLM — different LLMs per sub-agent (planning / search / code-gen / judges) with GEPA-optimised prompts per (LLM, sub-task) pair; combined effect "significantly reduce costs and latency" simultaneously with the accuracy gain. Canonicalised as concepts/multi-llm-sub-agent-routing + patterns/llm-per-subagent-with-optimized-prompts. Four- phase trajectory disclosed via worked example (CFO question about contradictory revenue dashboards): (1) parallel multi- agent data discovery, (2) data investigation (SQL extraction + comparative + root-cause), (3) self-correction loop / reconciliation (concepts/agent-self-correction-loop), (4) verification. Canonicalised as patterns/four-phase-data-agent-trajectory. Headline operational result: Genie 32% → over 90% accuracy vs "a leading coding agent" (name not disclosed) on Databricks' internal benchmark of real-world data-analysis tasks — claimed simultaneously on accuracy + cost + latency. First wiki naming of the data-agent vs coding-agent distinction (concepts/data-agent-unique-challenges) with three structural challenges: scale of data discovery, source-of-truth disambiguation (concepts/source-of-truth-disambiguation), and the verifiable-test gap. The post also makes the load-bearing dependency on upstream governance discipline mechanically precise — Genie derives its semantic context from existing workspace assets, so the prior Trinity-Industries empirical disclosure (effectiveness depends on prior measure-consolidation work) is now an architectural property: if the data layer hasn't disambiguated, Genie's specialised search has nothing rich to ground on. Architectural-canonicalisation contributions: (1) first mechanism-level Genie internals disclosure beyond Trinity adoption framing; (2) first wiki naming of data agent as a class structurally distinct from coding agent, with three uniquely-data-agent challenges; (3) first wiki canonicalisation of parallel thinking via trajectory sampling + aggregation as a named agent-design technique compensating for missing oracles; (4) first wiki canonicalisation of Multi-LLM sub-agent routing as a per-sub-agent assignment shape (vs prior model-cascade / routing-layer framings); (5) first wiki canonicalisation of GEPA (systems/gepa-prompt-optimizer) as a referenced production prompt-optimisation tool; (6) first wiki canonicalisation of the four-phase data-agent trajectory shape distinct from coding-agent write-test-iterate loops; (7) first wiki canonicalisation of the self-correction loop as a load-bearing intra-trajectory mechanism; (8) first wiki canonicalisation of the "agent-architecture choices recover all three of accuracy + cost + latency simultaneously" Pareto move (counter to the typical assumption that sampling trades cost for accuracy). Caveats: specific (LLM, sub-agent) assignments not disclosed; trajectory count + aggregation strategy not disclosed; self-correction mechanism not disclosed; internal-benchmark composition not disclosed; "leading coding agent" baseline name not disclosed; no latency / QPS / cost numbers for Genie endpoints. Sibling to the prior 2026-04-29 Trinity Industries adoption ingest (Genie deployed empirically) and the 2026-04-29 Stripe Projects + Databricks launch (Lakebase as substrate for agents) and the 2026-05-05 / 2026-05-06 / 2026-05-07 / 2026-05-08 Databricks platform-engineering arc (Pantheon-Hydra observability + Spark-Connect serverless OLAP + Lakebase OLTP + Model-Serving managed inference). Together the five-article 2026-05 Databricks engineering window now spans observability + OLAP + OLTP + managed-inference + data-agent altitudes in nine days.
-
2026-05-08 — sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together (Tier-3 Databricks Engineering post — first canonical wiki disclosure of Databricks Model Serving internals at the platform-engineering altitude; Databricks' fourth architectural-retrospective in nine days, extending the 2026-05 Databricks engineering ingest window across the OLTP (Lakebase) + OLAP (Serverless Compute) + observability (Pantheon/Hydra) tiers and now into managed external inference). Joint engineering retrospective with Superhuman documenting the migration of the grammar-correction LLM (40M+ daily users, peak 200,000+ QPS, sub-1s p99, 4-9's reliability, ~50/50-token request shape) off a self-managed vLLM-on-L40S DIY stack onto Databricks Model Serving on H100. Two-layer co-engineering: Platform layer — EDS-driven P2C load balancer (replacing default Kubernetes round-robin which "degrades at higher QPS"),
request_concurrency-based asymmetric autoscaler (aggressive scale-up, conservative scale-down for anti-flapping), block-device lazy-loading container image with 4MB sectors cutting pod start from "several minutes to a few seconds". Runtime layer — FP8 quantisation (single largest win, up to +30% per-pod QPS) with attention (Q/K/V/output) + MLP projections on FP8 path, KV-cache quantisation explicitly disabled ("weight quantization was where the throughput wins came from and KV-cache quantization introduced its own quality tradeoffs that weren't worth pursuing for this workload"); per-channel FP8 scaling beating off-the-shelf per-tensor scaling at matched throughput; hybrid- precision toggle so attention quantisation can be flipped on/off via flag without architectural change; multiprocessing RPC server (+20%) addressing the CPU-bound regime small fast LLMs hit on H100; single-call C++ tensor manipulation in the CUDA-graph decode step + an async CPU-GPU scheduler overlapping batch-N post-processing with batch-N+1 forward pass. Net per-pod throughput: 750 → 1,200 QPS on H100 (+60%). Architectural-canonicalisation contributions: (1) first wiki disclosure of Databricks Model Serving's internals beyond the product-name level; (2) first wiki disclosure of the CPU-bound regime for small fast LLMs with a named workload (Superhuman grammar correction); (3) first wiki canonicalisation of the managed-serving-without-giving-up- control division of responsibilities (customer owns model + quantisation + quality bar; platform owns runtime + LB + autoscaler + image substrate); (4) first wiki disclosure of KV-cache-quantisation-explicitly-off as the production-LLM- serving selective-FP8 boundary; (5) first wiki canonicalisation of the autoscaler-altitude anti-flapping primitive (asymmetric scale-up/scale-down to prevent the latency-spike flapping cycle); (6) first wiki canonicalisation of container-image cold start as a fourth serverless cold-start regime distinct from CPU-runtime-init / V8-isolate / GPU-weights- load. Joint shadow testing was the joint-engineering primitive used to tune autoscaler thresholds and validate quality on Superhuman's internal eval harness with zero quality regression. vLLM stayed in the toolchain post-migration as the prequantisation library (Superhuman's ML team prequantised the FP8 checkpoint "using vLLM's online quantization library"). Caveats noted: 200K-QPS / sub-1s-p99 / 4-9's claims are about the Superhuman endpoint specifically (not Databricks Model Serving in general); per-pod 1,200 QPS is at the 50/50-token shape on H100; quality validation is on Superhuman's internal eval harness (not a public benchmark); the L40S baseline was not separately re-tested with the new optimisations so some of the throughput improvement is enabled by H100's Transformer- Engine FP8 path. Sibling to the prior 2026-05-05 Pantheon/Hydra, 2026-05-06 Spark Connect / Serverless Compute, and 2026-05-07 Lakebase FPW-elimination ingests, completing a four-article Databricks platform-engineering arc covering observability + OLAP-compute + OLTP + managed-inference altitudes. -
2026-05-07 — sources/2026-05-07-databricks-how-lakebase-architecture-delivers-5x-faster-postgres-writes (Tier-3 Databricks Engineering post — fifth canonical Lakebase ingest + first mechanism-level disclosure of the pageserver's internals beyond the name-level framing that prior Lakebase sources established; Databricks' third architectural-retrospective in three days, extending the 2026-05 Databricks engineering ingest window across both the OLTP (Lakebase) and OLAP (Serverless Compute) altitudes). Thesis: classical Postgres's Full Page Write (FPW) primitive — which copies an entire 8 KB page into WAL the first time it's modified after each checkpoint so recovery can tolerate torn pages — is structurally redundant when compute is stateless and streams WAL to a Paxos-based safekeeper quorum (first wiki-explicit disclosure of the safekeeper's durability primitive). Verbatim: "Because there is no local-disk page to tear, the failure mode FPW was designed to prevent simply does not exist." But FPW had an incidental read-path role: its periodic images bounded delta-chain replay on the pageserver. Databricks solved this by pushing image generation down to the storage tier — the pageserver generates images when a page has accumulated more delta records than a configured threshold without an intervening image; compute sends only compact deltas. Three benefits named: network efficiency (94% WAL reduction), scalability (image generation shared across multiple pageservers in the background), optimal reads (per-page-change-rate cadence, not checkpoint-scoped). Quantified on HammerDB TPROC-C: 4 vCPU +20%, 16 vCPU 2.8×, 32 vCPU 4.5×+ (95,686 → 439,300 NOPM); WAL/transaction 58 KB → <4 KB (94% reduction); the pre-change flat 16v→32v scaling (95,832 → 95,686) canonicalises "compute resources were not being used because FPW was the bottleneck". Production customer (56 vCPU): WAL rate 30 MB/s → 1 MB/s (30× reduction). Read-path dividend: p99 −30% to −50%, p50 ~−30%. Synced Tables ingestion 17k → 62k rows/sec (3.6×). Rollout: "since late March" → globally active 2026-05-07 (~6-week window) via the existing Postgres
XLOG_FPW_CHANGEWAL record mechanism — canonicalised as the patterns/live-wal-protocol-switch-via-xlog-fpw-change pattern (in-log feature flag using a pre-existing control record that both compute and storage already parse; zero customer restarts). New canonicalisations (8 new wiki pages): 4 concepts (concepts/postgres-full-page-write, concepts/torn-page, concepts/postgres-checkpoint, concepts/delta-chain-replay) + 2 patterns (patterns/image-generation-pushdown-to-storage, patterns/live-wal-protocol-switch-via-xlog-fpw-change) + 1 system (systems/hammerdb) + the source page. Extended (5 existing pages): systems/lakebase (new axis section + Seen-in entry), systems/pageserver-safekeeper (image-generation responsibility + Paxos-quorum framing + XLOG_FPW_CHANGE rollout mechanism), systems/postgresql (first wiki quantification of 15×-WAL-inflation FPW ceiling +XLOG_FPW_CHANGEas live-rollout vehicle), concepts/compute-storage-separation (fifth axis — enabler for structural elimination of durability primitives, not just relocation), concepts/wal-record-granularity (WAL also carries control records that enable atomic protocol-switch rollouts). This ingest completes Lakebase's five-axis canonicalisation: CMK encryption (2026-04-20) + bursty workloads (2026-04-27) + agent provisioning (2026-04-29) + PITR operations (2026-04-30) + performance engineering via storage-work-offload (2026-05-07). Companion post: "Zero- downtime patching: Lakebase Part 1 — prewarming" (earlier 2026-05 Databricks post on cache prewarming); common thread is "move heavy-lifting tasks away from your transactions and into our scalable background storage stack." - 2026-05-06 — sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance (Tier-3 Databricks Engineering post — the second architectural-retrospective in two days extending the 2026-05 Databricks engineering ingest window). Thesis quote: "Stability becomes a system property rather than a user responsibility, enabled by architectures that isolate workloads, intelligently place them, and dynamically adapt resources." Three systems compose into Databricks Serverless Compute: (1) Spark Connect — the gRPC client-server rearchitecture of Spark's driver model, framed as "the most significant architectural transformation in Spark's history". User application code no longer co-executes with the driver; queries travel as serialised logical plans. Unit of execution shifts from processes to queries. Canonical instance of concepts/client-server-decoupling + patterns/grpc-decoupled-driver-client. Enables Databricks' disclosed 25+ major Spark runtime upgrades per year with 99.998% success rate across >2 billion workloads (cited from SIGMOD/PODS '25 Breese et al. paper "Blink Twice: Automatic Workload Pinning and Regression Detection for Versionless Apache Spark using Retries"). (2) Serverless Gateway — workload-aware router combining three real-time signals (query size from logical plan + cluster utilisation + interactive-vs-batch latency profile) with continuous re-evaluation as conditions shift. Canonicalises patterns/multi-signal-workload-aware-gateway-routing and resolves the concepts/utilization-vs-predictability-tradeoff at the pool layer. (3) Serverless Autoscaler — two-axis adaptive autoscaler scaling "horizontally and vertically" (concepts/vertical-and-horizontal-autoscaling); OOM-aware VM-restart primitive (concepts/adaptive-oom-recovery + patterns/oom-aware-vm-restart-autoscaling) detects task OOM and restarts the task on a larger VM without job failure. Customer outcomes: **CKDelta 20 min vs 4–5 hr (12–15× speedup), Unilever 2–5× faster + 25% cost reduction, HP 32% cloud savings
- 36% runtime reduction. Paired with the 2026-05-05 Pantheon/ Hydra ingest, this forms the 2026-05 Databricks architecture double-ingest** breaking the late-April / early-May Tier-3 Databricks marketing skip streak (23 new wiki pages touched: 1 source + 4 systems + 6 concepts + 3 patterns + extensions to apache-spark + databricks + companies/databricks).
- 2026-05-05 — sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring (Tier-3 Databricks Engineering post — the canonical architectural-retrospective that broke the Databricks 2026-05 skip streak). Scale datum: ~70 cloud regions / 3 major clouds / 160+ Pantheon instances / 5B active in-memory timeseries / 10T samples/day / 20B unaggregated timeseries in Hydra. Three coupled architectural responses to an order-of-magnitude growth in monitoring load: (1) Pantheon — a Thanos fork scaled to hyperscale with two Receive groups on distinct memory- retention tiers (2h / 30m), three isolated StatefulSets per group preserving quorum with stronger operational isolation, at-least-once block uploads (2 of 3 StatefulSets), and a purpose-built 3-controller control plane (Rollout Operator / Hashring Controller / Autoscaling + Self-Healing Controller) running "dozens of automations per week". (2) A Telegraf + Dicer aggregation shield sustaining >1 GB/s per region across thousands of rules, absorbing a 2-5× incident surge so Pantheon only saw 20%. Canonical instance of patterns/sticky-routing-for-aggregator-state (Kafka rejected for cost + latency). (3) Hydra — a lakehouse-native platform for raw high-cardinality troubleshooting data: Spark Structured Streaming + Auto Loader → Delta Lake with ~5 min freshness and 50× cheaper storage than Thanos, queryable from Grafana via a PromQL-to-SQL translation layer (canonical patterns/promql-to-sql-over-delta-tables) — so engineers' dashboards keep working unchanged while the substrate shifts fundamentally. Three canonical concepts promoted: concepts/metric-cardinality (primary TSDB scaling factor), concepts/tsdb-scaling-bottleneck (scale-ups as daily events), concepts/serverless-workload-churn-cardinality (tens of millions of VMs daily drives label churn). Migration outcomes: "millions of dollars in annual cloud costs" saved, "~5× reduction in monitoring infrastructure downtime," many sources of manual toil eliminated. Unified metric semantics across Pantheon + Hydra — "engineers should not need to understand our ingestion architecture" — canonicalises the patterns/dual-tier-observability-tsdb-plus-lakehouse pattern at hyperscale. First canonical wiki instance of CUJs applied to observability-platform design.
- 2026-04-30 — sources/2026-04-30-databricks-backstage-with-lakebase
(Tier-3 Thoughtworks guest-post on Databricks Blog, Part
1 of a three-part series — Part 2: Governance, Part 3:
FinOps — forthcoming). Fourth canonical wiki source on
Lakebase after CMK (2026-04-20),
LangGuard (2026-04-27), and Stripe Projects (2026-04-29).
Thoughtworks runs a proof-of-concept ripping
Backstage (Spotify's state-heavy
Internal Developer Portal, endorsed on Thoughtworks'
Technology
Radar) off its standard Postgres database and pointing
it at Lakebase. Central thesis: when branching + PITR
become effectively free, two separate engineering
practices collapse into the same primitive
("branching is just PITR with source_branch_time =
now"), rearranging the developer cycle enough to
deprecate 20-30% of test code (mock objects). Load-
bearing operational disclosures: (a) Wire-protocol-
Postgres compatibility — "Because it speaks wire-
protocol Postgres, Backstage doesn't know or care that it
isn't talking to RDS"; Backstage's Knex migrations ran
cleanly, only
PgSearchEnginehad to be swapped for Backstage's default in-memory search. (b) Auth was the friction point — Lakebase rejects classic Databricks Personal Access Tokens and expects an OAuth JWT minted bydatabricks postgres generate-database-credential. Thoughtworks wrapped the command in a 50-minute cron rewritingDATABRICKS_TOKENin.env— canonical patterns/credential-refresh-cron-as-auth-compat-shim. (c) First wiki disclosure of Lakebase branching throughput at MB-scale dataset granularity: a 63 MB Backstage catalog branch lands in 1.09 seconds data plane (control-plane ack was instant). Prior Lakebase sources disclosed only "seconds" (LangGuard) or "sub-350 ms" cold Postgres provisioning (Stripe Projects); this is the first wiki measurement separating control-plane ack from data-plane clone. (d) First wiki disclosure of Point-in-Time Recovery at Lakebase altitude: wipe offinal_entities(32 rows → 0), then recovery branch from a pre-wipe timestamp, end-to-end in 3.78 seconds. Production still at zero during recovery (branches fully isolated). Canonical concepts/point-in-time-recovery. (e) WAL-record granularity disclosed: requested 22:56:02Z, got 22:55:50Z (12 seconds earlier) — PITR snaps backward to the nearest WAL record. Canonical concepts/wal-record-granularity as an "important caveat for time-sensitive recovery workflows." (f) Architectural unification: patterns/branching-is-pitr-with-time-now — same control-plane call, same storage substrate, same compute-attach step; only the time parameter differs. Latency envelopes confirm (1.09 s vs 3.78 s). (g) Branch API gotcha: request body must nest everything inside aspecobject + explicitly specifyttl,expire_time, orno_expiry— branches are short-lived by default. (h) Developer-cycle transformation thesis: paired with the numbers, the POC argues cheap branching deprecates 20-30% of test code (mock objects — "not test coverage, that's test infrastructure") and shifts schema- migration validation from staging-deploy to development. See concepts/mock-object-maintenance-cost + concepts/integration-tests-against-real-database + patterns/database-branch-per-test-over-mocking. The load-bearing claim is the cost-benefit flip: "When branching a production-equivalent database costs nothing, mocking becomes the expensive choice." Introduces two new systems (systems/backstage, systems/databricks-postgres-cli), one MVP system (systems/thoughtworks-technology-radar), three new concepts (concepts/point-in-time-recovery, concepts/wal-record-granularity, concepts/mock-object-maintenance-cost), two additional concepts (concepts/oauth-jwt-short-lived-credential, concepts/integration-tests-against-real-database), and three new patterns (patterns/branching-is-pitr-with-time-now, patterns/database-branch-per-test-over-mocking, patterns/credential-refresh-cron-as-auth-compat-shim). Extends systems/lakebase, concepts/database-branching, concepts/copy-on-write-storage-fork, concepts/compute-storage-separation. Caveats: Tier-3 single-vendor POC with Thoughtworks as guest author; 1.09-s and 3.78-s numbers are single-shot in a development environment, not production-scale benchmarks; 63 MB is a developer-IDP-scale dataset (scaling behaviour at GB/TB not disclosed); 12-second WAL snap- back is a function of the POC's write cadence, not a general guarantee; 50-minute cron refresh is a POC hack, not a production pattern; 20-30% mock-code claim is attributed to "multiple partner teams" without methodology; Part 1 of 3 — governance + FinOps content forthcoming. Cross-source continuity: fourth Lakebase source on the wiki; together the four sources now canonicalise Lakebase's compute-storage-separated architecture at four distinct altitudes — encryption (CMK), bursty-workload fit (LangGuard), agent-lifecycle substrate (Stripe Projects), developer-cycle transformation (Backstage). Tier-3 on-scope rationale: architecture + mechanism + numbers density >60% — explicit dataset-size + wall-clock numbers (63 MB, 1.09 s, 3.78 s, 12 s snap-back, 50-min cron), explicit mechanism disclosure (copy-on-write pointer semantics, WAL-record snap-back, branching-≡-PITR unification, OAuth-JWT vs PAT auth posture,spec-nested API with explicit lifetime), explicit integration trade-off discussion (mock deprecation, schema-migration-validation shift). Passes the 20% borderline-case threshold decisively. Not a product-launch post; it's a workflow-transformation retrospective by a consultancy.) - 2026-04-29 — sources/2026-04-29-databricks-and-stripe-projects-infrastructure-built-for-agents
(Tier-3 short joint launch post co-bylined by Brad Van
Vugt and Guillaume Rivals announcing Databricks as a
launch partner for Stripe
Projects — the agent-first CLI Stripe launched
2026-04-30. The Databricks side of the integration
exposes Neon Postgres databases (under the
Lakebase architecture name) as
agent-provisionable resources through the Stripe Projects
catalog, making Databricks the second launch-side
provider in the
agent-provisioning protocol after Cloudflare. Disclosed
operational datum: <350 ms for an agent-driven
production-ready Neon Postgres via the CLI — the first
wiki operational number for the protocol's per-request
latency envelope at the database-resource tier. Collapses
the prior Lakebase-vs-Neon distinction by naming both
interchangeably ("Lakebase architecture, developed by
Neon" + "bringing Neon databases seamlessly"). The
post articulates the three architectural pillars that
make agent-driven OLTP viable on this substrate:
serverless scale-to-zero, instant
database branching via
zero-copy cloning
for "safely test code, run migrations, or experiment
with new prompts against live data states", and
Postgres compatibility as an agent-ergonomic property
("agents understand Postgres better than any other
OLTP database"). Third canonical cross-source
confirmation of concepts/compute-storage-separation
as Lakebase's load-bearing property — new axis:
per-request compute lifecycle at agent-initiated cadence
("agents can create, build, and tear down OLTP
databases in seconds"). Introduces
concepts/agent-provisioned-database as a new concept
(database-tier sibling of
concepts/agent-provisioned-account) and adds Lakebase
as third known-use of
patterns/partner-managed-service-as-native-binding
(after Cloudflare/PlanetScale + Fly.io/Tigris) — first
agent-as-customer instance of that pattern, with a
payments-platform-orchestrator rather than a compute-
platform-orchestrator. Co-announces but does not deep-
dive the separate Stripe Data Pipeline × Databricks
Marketplace zero-ETL integration (deferred to sibling
post "Stripe data now available in Databricks"). Short
post, high marketing density; Tier-3 ingest-threshold
passed on (a) architectural density of three-pillar
paragraph >20%, (b) new operational datum <350 ms, (c)
first wiki record of second launch-partner in the
protocol, (d) new
agent-provisioned-databaseconcept. Caveats: spend-cap / rate-limit / fraud-heuristic / orphan-cleanup policies not disclosed; <350 ms is typical-case single-shot, not concurrent-burst; Lakebase ↔ Neon naming collapsed; no internal-mechanism disclosure on any of the three architectural pillars beyond verbatim quotes. Introduces concepts/agent-provisioned-database. Extends systems/lakebase, systems/stripe-projects, concepts/scale-to-zero, concepts/compute-storage-separation, concepts/database-branching, concepts/copy-on-write-storage-fork, patterns/agent-provisioning-protocol, patterns/partner-managed-service-as-native-binding. Cross-source continuity: companion-launch-partner to sources/2026-04-30-cloudflare-agents-can-now-create-cloudflare-accounts-buy-domains-and-deploy (one-day earlier Cloudflare + Stripe launch of the protocol); third Lakebase source on the wiki after sources/2026-04-20-databricks-take-control-customer-managed-keys-for-lakebase-postgres|2026-04-20 CMK and sources/2026-04-27-databricks-inside-one-of-the-first-production-deployments-of-lakebase-langguard|2026-04-27 LangGuard; collectively these three sources canonicalise Lakebase's compute-storage-separated architecture at three distinct altitudes — encryption (CMK), bursty- workload fit (LangGuard), agent-lifecycle substrate (this post).) - 2026-04-29 — sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics (Tier-3 Databricks product-engineering post announcing four new Apache DataSketches-backed sketch function families in Databricks SQL / DataFrame / Structured Streaming: KLL for percentiles, Theta for distinct-count with set algebra, Approximate top-K for heavy hitters, Tuple for distinct
- metric aggregation. Canonicalises decision-support vs audit query as the architectural classifier that determines whether it's safe to accept 1–2% relative error in exchange for orders-of-magnitude compute reduction. Framing: "If knowing '~4.7M unique users ±1%' leads to the same decision as '4,712,389 unique users,' the approximate answer at a fraction of the cost is strictly better." The pattern contribution is sketch as Delta BLOB column — build once at ETL, merge on read in milliseconds — extending the pre-existing sketch-as-BLOB pattern (PlanetScale Insights) to the Delta Lake substrate and exposing it at the SQL aggregate level. Named concrete use cases: latency monitoring dashboards ("168 precomputed sketches" for a weekly trending view), audience-overlap marketing measurement via Theta union/intersection/ difference (Super Bowl ad ∩ Instagram campaign), live clickstream leaderboards via approximate-top-K streaming micro-batch merges, unique-customer-plus-revenue composed aggregations via Tuple sketches. The post is explicit about the boundary: "When to stay exact: Financial auditing, compliance reporting, or any use case where regulatory or business requirements demand precise values." Critically positions mergeability — the associative merge operator — as the architectural unlock, turning sketches from "a faster percentile" into a storage primitive that converts dashboards from scans into reduces. Community contribution: Christopher Boumalhab (cboumalh on GitHub) implemented the Theta and Tuple sketch function families in upstream Apache Spark. Agent-surface continuity: Genie Code can recommend the right sketch family — Databricks' ongoing pattern of threading agentic surfaces through every new capability. Introduces systems/apache-datasketches, concepts/kll-quantile-sketch, concepts/theta-sketch, concepts/approximate-top-k-sketch, concepts/tuple-sketch, concepts/mergeable-sketch, concepts/decision-support-vs-audit-query, patterns/precomputed-sketch-column-in-delta-table, patterns/set-algebra-on-theta-sketches. Extends concepts/probabilistic-data-structure, concepts/ddsketch-error-bounded-percentile, concepts/sketch-as-mysql-binary-column, concepts/local-global-aggregation-decomposition. Fifth Databricks-platform face on the wiki (after Redpanda/Iceberg analytics, Santander integration, Zalando Spark workload, multimodal-substrate, DICER auto-sharder/MLflow/Lakebase, and AutoCDC declarative CDC) and canonicalises the Databricks-as- approximate-analytics-primitive-vendor framing. Vendor post; benchmark claims self-reported ("1000× speedup", "orders of magnitude less compute") and not independently verified.)
- 2026-04-27 — sources/2026-04-27-databricks-inside-one-of-the-first-production-deployments-of-lakebase-langguard (Tier-3 Databricks case study profiling LangGuard — one of the first startups building its production governance engine on systems/lakebase. Introduces systems/langguard as a runtime enforcement layer for enterprise agentic workflows and GRAIL as its patent-pending live-knowledge-graph governance fabric. Canonicalises the three-property Lakebase fit for bursty agentic workloads: (1) serverless autoscaling + concepts/scale-to-zero between bursts — agent workflows are dormant for hours, then hundreds of trace writes + enforcement reads in seconds; (2) millisecond reads via compute-local cache for hot indexed lookups against GRAIL context + policy tables keeps governance off the agent's critical path; (3) instant copy-on-write database branching for testing new governance policies against real production trace data in seconds without risking the live environment — the first canonical wiki instance of DB branching at governance-policy-validation altitude (distinct from schema-change-testing and dev-sandbox uses). LangGuard's team previously built IBM QRadar (SIEM at petabytes/day); the QRadar-era lesson cited verbatim: "database architecture is destiny" for bursty security-telemetry workloads; coupled-compute/storage Postgres forced provisioning for peak and paying for idle around the clock. Stated roadmap: train behavioral baselines on historical GRAIL trace data via MLflow to move from reactive runtime enforcement to predictive governance; co-location of operational trace data with the analytical platform removes the ETL barrier that normally separates the two. Enterprise workflow scope cited: "tens of coordinated agents, hundreds of tool invocations, multiple foundation models, and policies managed across fifteen or more enterprise Systems of Record" — ServiceNow, IAM/IDP, Salesforce, Workday, Wiz, CrowdStrike, TalkDesk, MCP Gateways, API Gateways. Introduces systems/langguard, systems/grail-data-fabric, concepts/agentic-workflow-governance, concepts/runtime-policy-enforcement, concepts/agent-behavioral-baseline, patterns/runtime-governance-enforcement-layer, patterns/policy-testing-via-database-branching; extends systems/lakebase, concepts/compute-storage-separation, concepts/database-branching, concepts/copy-on-write-storage-fork, concepts/scale-to-zero, concepts/bursty-query-pattern. Tier-3 Databricks — borderline product-case-study post (joint vendor narrative with LangGuard), ingested because architectural content is substantive (~60% of body: bursty-workload shape, compute/storage-separation payoff for scale-up with no data movement, copy-on-write branching for policy testing, QRadar lineage as empirical prior) even though ~40% is product positioning. No concrete latency/throughput numbers disclosed — GRAIL schema + query patterns undisclosed; predictive governance is roadmap, not shipped.)
- 2026-04-22 — sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines
(Tier-3 Databricks product-engineering post pitching
AutoCDC inside
Lakeflow SDP
as the declarative replacement for hand-rolled
MERGElogic in CDC / SCD pipelines. Names the four structural pain sources of hand-rolled CDC (out-of-order updates, duplicate events, delete application, idempotency across retries) and maps each to a declarative API parameter (keys,sequence_by,apply_as_deletes,stored_as_scd_type). Canonicalisessequence_byas the load-bearing primitive for out-of-sequence CDC event handling — a concern separate from idempotency that hand-rolled pipelines routinely get wrong (arrival-order-as-logical-order assumption, dedup without ordering, window-function dedup without retry-safety). Three input modes supported: native Change Data Feed (CDF), CDF with SCD Type 2 history, and snapshot-diff inference (the third canonical CDC ingest shape on this wiki after log-based capture and CDF). Code footprint disclosure: 6–10 lines AutoCDC vs 40–200+ lines hand-rolled MERGE; Fortune 500 Aerospace & Defense adopter quote: "4 lines of code could replace what I was doing in 1,500 lines of code before." Databricks Runtime performance improvements since Nov 2025: 71% better perf-per-dollar on SCD Type 1, 96% on SCD Type 2 — propagated universally to all AutoCDC pipelines because the declarative API lets Runtime-level optimisations apply across the fleet. Named regulated-vertical adopters: Navy Federal Credit Union (billions of events/day real-time event processing), Block (pipeline dev time: days → hours), Valora Group (Swiss foodvenience retail master-data CDC). Rhetorical framing: Databricks argues LLM codegen does not solve CDC correctness ("LLMs can generate code, but they don't understand your data") and positions Genie Code as the AI codegen client that produces AutoCDC declarations rather than rawMERGE— LLM correctness envelope becomes equal to the declarative API's bounded envelope. Reinforces MERGE over INSERT OVERWRITE as the runtime primitive — AutoCDC displaces hand-authoring, not MERGE itself. Introduces systems/databricks-autocdc, systems/databricks-genie-code, concepts/snapshot-diff-inference-cdc, concepts/out-of-sequence-cdc-event-handling, patterns/declarative-cdc-over-hand-rolled-merge. First wiki source on declarative CDC as a distinct pattern axis separate from the CDC-ingest-mode taxonomy (log-based vs CDF vs snapshot-diff). Tier-3 Databricks — ingested because architectural framing is substantive (≥50% of body is declarative-vs-hand-rolled tradeoff + API-parameter semantics - before/after code + perf / code-reduction numbers + named adopters); fails no skip signals.)
- 2026-04-22 — sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai
(Tier-3 Databricks healthcare-vertical post that is
architecturally non-trivial: names the specialty-store-per-
modality failure mode (FHIR store + omics store + imaging
store + vector store — duplicated governance, brittle
cross-store joins) and positions the lakehouse-as-multimodal-
substrate as its remedy. Every modality (genomics, imaging
features, clinical-notes entities, wearables aggregates)
lands in governed Delta tables under
one Unity Catalog governance
surface; modality-specific tooling —
Glow (VCF / BGEN / PLINK → Delta),
Mosaic AI Vector Search
over imaging-derived feature embeddings,
Lakeflow SDP
(
@dp.table+@dp.materialized_viewfor wearables streaming) — sits above the substrate rather than beside it. Canonicalises the four-fusion-strategy taxonomy (concepts/early-fusion / concepts/intermediate-fusion / concepts/late-fusion / concepts/attention-based-fusion) paired with deployment-reality triggers — "match fusion to your deployment reality: modality availability patterns, dimensionality balance, and temporal dynamics" — and names the missing-modality problem ("missingness isn't an edge case — it's the default") with three production responses (modality masking during training, sparse / modality-aware attention, transfer learning). Reproducibility story pinned to Delta time travel + CI/CD + MLflow experiment tracking. Introduces systems/databricks-glow, systems/lakeflow-spark-declarative-pipelines, systems/mosaic-ai-vector-search, patterns/governed-delta-tables-per-modality, patterns/fusion-strategy-selection-by-deployment-reality, concepts/early-fusion, concepts/intermediate-fusion, concepts/late-fusion, concepts/attention-based-fusion, concepts/missing-modality-problem, concepts/modality-masking-during-training. No production metrics disclosed (vendor post), no code snippets. Extraction scoped to the architectural content; healthcare-vertical specifics — tumor boards, trial matching, 28 CFR Part 202 tagging — preserved only where they illustrate a sysdesign point. First wiki ingest naming Lakeflow SDP, Glow, Mosaic AI Vector Search.) - 2026-04-22 — sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization
(Databricks + UPenn research prototype applying a frontier
LLM agent as an offline
join-order tuner for the Databricks query engine. Names
the three-component optimizer decomposition (cardinality
estimator + cost model + search) and frames the LLM as
offline-only — too slow for the optimizer hot path, but
the perfect fit for the historically-human DBA tuning loop
(concepts/offline-query-tuning-loop). Architecture:
single tool
execute_plan(candidate)returning runtime + subplan sizes; rollout budget (50 prototype, 15 eval); grammar-constrained structured output admitting only valid join reorderings (patterns/structured-output-grammar-for-valid-plans); best-of-N selection (concepts/anytime-optimization-algorithm / patterns/rollout-budget-anytime-plan-search). Evaluation on JOB (113 queries, 10× IMDb): 1.288× geomean / 41% P90 — beating perfect cardinality estimates (because measurement beats estimation), smaller LLMs, and classical BayesQO Bayesian-optimization baseline (Postgres-tuned asymmetry noted). Canonical win: JOB query 5b'sLIKE-predicate failure case (concepts/like-predicate-cardinality-estimation-failure). Introduces systems/databricks-join-order-agent, systems/join-order-benchmark-job, systems/bayesqo, concepts/join-order-optimization, concepts/cardinality-estimation, concepts/llm-agent-as-query-optimizer, concepts/anytime-optimization-algorithm, concepts/offline-query-tuning-loop, concepts/like-predicate-cardinality-estimation-failure, concepts/exploration-exploitation-tradeoff-in-agent-search, patterns/llm-agent-offline-query-plan-tuner, patterns/structured-output-grammar-for-valid-plans, patterns/rollout-budget-anytime-plan-search. Tier-3 Databricks — borderline research/ML post, but content is substantively about database-engine optimizer architecture (join ordering, cardinality estimation, cost models, plan search) rather than ML methodology. Ingested for the architectural framing. Research prototype, not a shipping feature; cost-per-tuned-query not disclosed.) - 2026-04-17 — sources/2026-04-17-databricks-governing-coding-agent-sprawl-with-unity-ai-gateway (Launch post for Unity AI Gateway coding-agent support. Names coding- agent sprawl — engineers routinely mix Cursor + Codex + Claude Code + Gemini CLI + others in parallel — as the forcing function. Three-pillar answer: (1) centralised security + audit via Unity Catalog + MLflow tracing + single-SSO across all coding tools + Databricks-managed MCP servers; (2) single bill + cost controls via Foundation Model API first-party inference + BYO external capacity + per-developer (not per-tool) budgets — patterns/unified-billing-across-providers; (3) full observability via OpenTelemetry → Unity-Catalog-managed Delta tables joinable with HR/PR-velocity data — patterns/telemetry-to-lakehouse. Launch-day clients: Cursor, Codex CLI, Gemini CLI; Claude Code via MLflow 3 tracing. Structural mirror of the Cloudflare internal AI engineering stack shape with different substrates. Extends the wiki's existing AI-gateway provider- abstraction pattern along two new axes: coding-tool clients as first-class, and MCP-server governance as a peer concern to LLM-call governance. Introduces systems/unity-ai-gateway, systems/databricks-foundation-model-api, systems/cursor, systems/claude-code, systems/codex-cli, systems/gemini-cli, concepts/coding-agent-sprawl, concepts/centralized-ai-governance, patterns/central-proxy-choke-point, patterns/telemetry-to-lakehouse, patterns/unified-billing-across-providers. Tier-3 Databricks — ingested for framing + integration architecture, not internals; product-announcement post, gateway routing/fallback/rate-limiter mechanics not disclosed.)
- 2026-04-20 — sources/2026-04-20-databricks-take-control-customer-managed-keys-for-lakebase-postgres (Customer-Managed Keys rollout for systems/lakebase — Databricks' serverless Neon-descended Postgres. Three-level concepts/envelope-encryption hierarchy CMK → KEK → DEK; CMK held in customer's cloud KMS (systems/aws-kms / systems/azure-key-vault / systems/google-cloud-kms); concepts/cryptographic-shredding on revocation across both persistent (Pageserver+Safekeeper) and ephemeral (Postgres compute VM) layers; patterns/per-boot-ephemeral-key pattern for VM-local state; seamless key rotation as a property of the envelope hierarchy; Account↔Workspace delegation for separation of duties; auditability in customer KMS tenancy. Enterprise tier. Introduces systems/lakebase, systems/pageserver-safekeeper, systems/aws-kms, systems/azure-key-vault, systems/google-cloud-kms, concepts/envelope-encryption, concepts/cmk-customer-managed-keys, concepts/cryptographic-shredding, patterns/per-boot-ephemeral-key.)
- 2026-04-20 — sources/2026-04-20-databricks-mercedes-benz-cross-cloud-data-mesh
(Mercedes-Benz cross-cloud data mesh on Unity Catalog + Delta
Sharing + Delta Deep Clone; AWS Iceberg-on-Glue producer ↔ Azure
Delta-on-ADLS consumers; hybrid "live share vs. incremental replica"
tier; 66 % egress cost reduction on first 10 data products, ~93 %
projected annual at 50 use cases; weekly-load → every-second-day
freshness; DDX self-service orchestrator, DABs + Azure DevOps
deploys, Sync-Job-bytes → producer chargeback,
VACUUMfor GDPR delete propagation on replicas; introduces systems/delta-sharing, systems/delta-lake, systems/mercedes-benz-data-mesh, concepts/data-mesh, concepts/egress-cost, concepts/hub-and-spoke-governance, concepts/cross-cloud-architecture, patterns/cross-cloud-replica-cache, patterns/chargeback-cost-attribution.) - 2026-01-13 — sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder (open-sourcing Dicer, Databricks' auto-sharder; dynamic slice-range sharding with hot-key isolation/replication, eventually-consistent Assignments, state transfer across reshards; positioned vs. prior art Slicer / Centrifuge / Shard Manager; three production case studies — Unity Catalog 90–95% hit rate, SQL Query Orchestration zero-downtime scaling, Softstore 85% hit rate across rolling restarts via state transfer; use cases include LLM KV cache / LoRA-adapter GPU placement, batch aggregation, soft leader selection, rendezvous coordination.)
- 2025-12-03 — sources/2025-12-03-databricks-ai-agent-debug-databases (internal AI agent platform Storex for DB debugging across thousands of instances / 3 clouds / hundreds of regions / 8 regulatory domains; central-first sharded foundation; DsPy-inspired tools-as-functions framework; snapshot-replay validation with judge LLM; specialized per-domain agents; hackathon → platform journey; claimed up to 90% investigation-time reduction, <5 min new-hire ramp-up.)
- 2025-10-01 — sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing (proxyless client-side L7 LB + custom xDS EDS; P2C + zone-affinity with spillover; rejected Istio / headless services; 20% pod-count reduction; surfaced cold-start problem that long-lived L4 LB had hidden.)
Ingest posture¶
Tier-3 filter applies: by default skip product PR, acquisition news,
pure ML methodology posts. Ingest when the article covers:
distributed-systems internals, scaling trade-offs, Kubernetes / network
infrastructure, production incidents, storage/streaming design, or
data-platform internals (Photon, Delta Lake, Unity Catalog — when
architecturally substantive). Several 2025 posts already reviewed and
logged as off-topic in log.md (TAO LLM-tuning, Neon acquisition PR,
Data Intelligence for Marketing launch).