SYSTEM Cited by 9 sources
Databricks¶
Databricks is a managed lakehouse / data-+-AI platform combining Delta Lake storage, Spark compute, SQL, ML, and governance (Unity Catalog) into one multi-cloud substrate.
Stub page. Databricks' own engineering blog is covered under companies/databricks and drills into substantially deeper internals (DICER auto-sharder, intelligent K8s load balancing, Lakebase Postgres CMK, Mercedes-Benz cross-cloud data mesh, MLflow 3 judges, Storex AI-agent database debugger, etc.). This page exists because Databricks is named in the Santander Catalyst post as an integration target.
Seen in¶
- sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance — Serverless-Spark compute-substrate framing. Seventh Databricks-platform face on the wiki (after Redpanda/Iceberg analytics + Santander integration + Zalando Spark workload + multimodal-substrate + query-optimizer research + declarative CDC + approximate-analytics primitives). First wiki instance of Databricks Serverless Compute as a composed product (systems/databricks-serverless-compute) built on three new wiki systems: Spark Connect (gRPC client-server rearchitecture of Spark's driver), Serverless Gateway (three-signal workload-aware routing), and Serverless Autoscaler (two-axis adaptive autoscaling with OOM-aware VM restart). Canonicalises four new concepts (concepts/stability-as-system-property, concepts/query-size-from-logical-plan, concepts/adaptive-oom-recovery, concepts/vertical-and-horizontal-autoscaling, concepts/utilization-vs-predictability-tradeoff, concepts/client-server-decoupling) and three new patterns (patterns/grpc-decoupled-driver-client, patterns/multi-signal-workload-aware-gateway-routing, patterns/oom-aware-vm-restart-autoscaling). Production numbers (25+ Spark runtime upgrades/yr at 99.998% success across 2B+ workloads; CKDelta 12–15× speedup; Unilever 2–5× faster + 25% cost reduction; HP 32% savings + 36% runtime reduction) cite the SIGMOD/PODS '25 companion paper "Blink Twice: Automatic Workload Pinning and Regression Detection for Versionless Apache Spark using Retries". Paired with the 2026-05-05 Pantheon/Hydra ingest, this forms the 2026-05 Databricks architecture double-ingest breaking the prior late-April / early-May Tier-3 Databricks marketing skip streak.
- sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Approximate-analytics-primitive-vendor framing. Databricks SQL / DataFrame / Structured Streaming expose four new Apache DataSketches-backed aggregates: KLL (percentiles), Theta (distinct + set algebra), approx top-K (heavy hitters), Tuple (distinct + metric aggregation). Intended workflow: sketch as Delta BLOB column — build once during ETL, merge on read in milliseconds. Canonicalises decision-support vs audit as the architectural gate on approximate analytics, and mergeability as the storage- primitive-making property. Community contribution: Christopher Boumalhab implemented Theta + Tuple sketch functions in upstream Apache Spark. Fifth Databricks-platform face on the wiki (Redpanda/Iceberg analytics + Santander integration + Zalando Spark workload + multimodal-substrate + approximate-analytics- primitives).
- sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai — Multimodal-substrate framing. Databricks pitches the lakehouse as the unifier across modalities (genomics, imaging features, clinical-notes entities, wearables streams) under one Unity Catalog governance surface; modality-specific tooling ( Glow, Mosaic AI Vector Search, Lakeflow SDP) is layered above. First wiki ingest naming Glow / Mosaic AI Vector Search / Lakeflow SDP. Fourth Databricks-platform face on the wiki (Redpanda/Iceberg analytics + Santander integration + Zalando Spark workload + multimodal-substrate).
- sources/2026-01-06-redpanda-build-a-real-time-lakehouse-architecture-with-redpanda-and-databricks — Joint-vendor framing as analytics + governance target of Redpanda Iceberg Topics streams. Tech-talk recap with Jason Reed (Databricks, formerly Netflix data team) as the architectural voice on Iceberg's origin. Databricks' role in the real-time lakehouse architecture is the Unity-Catalog- governed analytics / AI compute layer; streaming data lands in Iceberg via Redpanda's broker-native integration and becomes queryable by "any Iceberg-compatible engine connected to Unity Catalog". Three-system labour division verbatim: "Redpanda delivers real-time performance and reliability at scale. Iceberg provides an open, transactional table format optimized for analytics. Unity Catalog adds governance, optimization, federation, and lifecycle management across the entire system." First wiki ingest with a Databricks speaker cited on architecture; Reed's "The data shows up already structured, already governed, and already queryable" canonicalises the Databricks-side framing of Iceberg topics + broker- native catalog registration as the zero-ETL mechanism.
- sources/2026-02-26-aws-santander-catalyst-platform-engineering — Santander's modern data platform workload on Catalyst includes "built-in integration with Databricks" alongside data lakes, automated ETL, centralized data catalog, and segregated experimentation environments; the net effect is ~3,000 monthly data-experimentation provisioning tickets eliminated.
- — Zalando ML Platform uses Databricks as the Spark step inside systems/zflow workflows. Two distinct roles in the Payments risk-scoring pipeline: (1) training-data preprocessing step (Databricks cluster runs heavy feature derivation on historical data); (2) model performance report step (Databricks job generates a PDF of PR-AUC / ROC / custom metrics after evaluation). First wiki instance of Databricks as one of the interchangeable compute substrates inside an internally-authored ML workflow library built on AWS Step Functions.
- sources/2022-04-18-zalando-zalandos-machine-learning-platform — Databricks as Zalando's big-data experimentation substrate, distinct from the workload-class role above. In addition to being a zflow step target, Databricks is the second of three experimentation substrates Zalando offers ML practitioners — alongside Datalab (prototyping, quick feedback) and the GPU HPC cluster (CV / large-model training). "While Datalab is well suited for prototyping and getting quick feedback, it's not always enough, especially when big data is involved. Apache Spark is much better suited for that purpose, and Zalando users can access it via Databricks." Canonical first wiki instance of Databricks in the experimentation-substrate role (not just a production pipeline step).
- sources/2025-06-29-zalando-building-a-dynamic-inventory-optimisation-system-a-deep-dive — Databricks as the pre-processing tier of Zalando's ZEOS inventory-optimisation pipelines. Both the demand forecaster and the replenishment recommender run PySpark + Spark-SQL feature pre-processing on transient Databricks Job clusters writing to Delta Lake ("data pre-processing layer"). First wiki instance of Databricks + Delta Lake as the horizontal tier of a two-tier feature-engineering pipeline (the vertical tier is SageMaker Processing Job); see concepts/data-preprocessing-vs-data-transformation-split and patterns/pyspark-preprocessing-to-python-transformation-split. Also canonicalises patterns/transient-databricks-cluster-per-run — "every run triggers dedicated Databricks Job clusters and SageMaker processing/training jobs. This ensures robust and independent runs and resources, i.e. a failure of one execution in the Databricks job cluster does not impact a parallel execution."
- — Databricks managed Delta Sharing + Unity Catalog as Zalando Partner Tech's external-partner data-sharing platform. Sixth Databricks-platform face on the wiki (after Redpanda/Iceberg analytics + Santander integration + Zalando Spark workload + multimodal-substrate + query-optimizer research + declarative CDC). First wiki instance of Databricks' managed Delta Sharing service — prior Delta Sharing coverage (sources/2026-04-20-databricks-mercedes-benz-cross-cloud-data-mesh) was internal cross-cloud data mesh; this is external-partner B2B data sharing. Scale: 200+ datasets / up to 200TB / >€5B GMV / thousands of partners. Load-bearing Databricks-role framing: Zalando explicitly chose managed Databricks Delta Sharing over self-hosting the open-source reference server for the "operational excellence" of a Unity-Catalog-governed
- audit-logged + credential-managed service. Quote: "the managed service offered something invaluable: the operational excellence we needed for a production system serving critical partner relationships" + "operational excellence often trumps technical purity". Canonical instance of patterns/managed-services-over-custom-ml-platform generalised beyond ML. Databricks components used: Unity Catalog (governance plane — catalogued datasets + Shares + Recipients), Delta Sharing protocol (open wire format), recipient tokens (Databricks-issued activation URLs + credential profile files), Delta Lake (underlying table format). Roadmap item: Databricks OIDC federation for recipient authentication, which would remove provider-issued intermediate tokens. Introduces systems/zalando-partner-data-sharing-platform as the new wiki system deployed on top of this Databricks stack.
- (Many more via companies/databricks.)
- sources/2026-04-22-databricks-are-llm-agents-good-at-join-order-optimization
— Databricks' own query engine as the execution substrate
for an LLM-agent join-order optimizer experiment. Unlike the
other Seen-in entries where Databricks is an integration target
or workload substrate, here Databricks is the engine being
optimized: the UPenn-collaboration agent's single tool
(
execute_plan) runs against Databricks and returns runtime + subplan sizes equivalent toEXPLAIN EXTENDED. First wiki instance of Databricks' query planner being explicitly named as a target of external optimization research. Evaluation: 113-query JOB benchmark on 10×-scaled IMDb, 1.288× geomean workload speedup, 41% P90 latency drop — outperforming perfect cardinality estimates, smaller LLMs, and the prior-art BayesQO Bayesian-optimization baseline. Canonical win: query 5b (5-way join withLIKEpredicates on VHS release notes) where Databricks' default optimizer misjudges selectivity due to concepts/like-predicate-cardinality-estimation-failure. Introduces systems/databricks-join-order-agent (prototype), systems/join-order-benchmark-job (benchmark), and systems/bayesqo (baseline) as new wiki systems, plus the full LLM-agent-as- query-optimizer / offline-tuning-loop / [[patterns/llm-agent-offline-query- plan-tuner]] concept/pattern stack.
Related¶
- companies/databricks — Databricks' own engineering blog as a primary source of architectural content
- systems/unity-catalog, systems/delta-lake, systems/mlflow, systems/lakebase — named Databricks systems across other ingested sources
- sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines
— Declarative-CDC framing: Databricks' authoring-layer
positioning for CDC and
SCD pipelines. Fifth
Databricks-platform face on the wiki (after Redpanda/Iceberg
analytics + Santander integration + Zalando Spark workload +
multimodal-substrate + query-optimizer research). Introduces
AutoCDC as the declarative API
surface hosted on
Lakeflow SDP
runtime, and Genie Code as
the AI-codegen client that produces AutoCDC declarations
rather than hand-rolled
MERGE. Runtime improvements since Nov 2025: 71% / 96% perf-per-dollar gains on SCD Type 1 / Type 2, propagated universally via the declarative API. Named production adopters (Navy Federal Credit Union, Block, Valora Group) span regulated verticals. Code footprint: 6–10 lines AutoCDC vs 40–200+ lines hand-rolled, with one adopter reporting "1,500 lines → 4 lines". Canonicalises patterns/declarative-cdc-over-hand-rolled-merge as the pattern name, and introduces concepts/snapshot-diff-inference-cdc + concepts/out-of-sequence-cdc-event-handling as CDC sub-concepts. Tier-3 Databricks post ingested for substantive architectural content (declarative vs hand-rolled tradeoff, concrete API parameters, operational numbers, regulated- vertical adopter disclosures) — fails no skip signals.