Databricks — The Rosetta Stone of CPS: Inside Claroty's AI-Powered Library¶
Databricks Blog co-marketing post (2026-05-13) describing how Claroty's AI-Powered CPS Library — the asset-identity layer for Claroty's xDome CPS-protection platform — is built on Databricks. Tier-3 vendor-blog source. Borderline-include scope decision: heavy GenAI MVP co-marketing framing (~70%), but the Under the Hood / Data Engineering at Scale / Multi-Agent Intelligence / Innovation through Databricks Capabilities sections (~30%) name a specific architectural shape worth canonicalising — a hybrid Entity Resolution pipeline that combines classical ER with an orchestrated multi-agent system (NLP / Reasoning / Human-in-the-loop), all plumbed onto the Medallion Architecture over Delta Lake with Delta Change Data Feed driving a dynamic mapping registry, and a real production observation about vector-search endpoints lacking scale-to-zero for bursty workloads.
One-paragraph summary¶
Cyber-Physical Systems (CPS) — "the machinery that powers our factories, hospitals, and critical infrastructures" — suffer from an industry-scale identity crisis: per Claroty's Team82 research, 88% of CPS assets do not transmit an exact product code, and 76% use product codes that differ from the vendor's official records. The asset's "digital birth certificate" is missing. Without it, vulnerability management is manual detective work — a string like Rockwell Automation's 1769-L36ERMS/B has to be cross-referenced by hand to the Compact GuardLogix 5370 commercial name, then to CISA / NVD advisories, then to a CPE entry that may or may not match the exact sub-type and firmware version. Claroty's AI-Powered CPS Library automates this as an Entity Resolution problem at catalog scale (17 million+ assets). The architecture is deliberately hybrid: "battle-tested, classic ER methods with the cognitive power of Generative AI." Bronze raw payloads land in append-only Delta tables; a promotion pipeline reads Delta Change Data Feed and applies a mapping registry to canonicalise into a governed schema, with Delta schema evolution + time travel preserving an "unbreakable chain of custody" back to the original raw artifact and the specific mapping version that classified it. The agent layer is an orchestrated multi-agent system of three roles: NLP Agents parse mixed-format protocol strings and obscure software markers; Reasoning Agents apply confidence scoring and statistical tests to weigh evidence; Human-in-the-loop flags low-confidence mappings for SME review with the corrections fed back into training. Domain-specific medical embedding models deploy as custom Model Serving endpoints (generic embeddings were "insufficient for the level of precision we require"); a Knowledge Assistant + Information Extraction agent ingest proprietary documentation; MLflow + LLM-as-a-Judge provides continuous evaluation against concept drift in production. Lakebase (managed Postgres on the lakehouse) holds transactional asset mappings for low-latency queries with strict constraints. Databricks Apps (React/Streamlit + Lakebase) hosts the SME human-in-the-loop UI. Reported outcomes: "25% improvement in vulnerability identification accuracy" and "56% of analyzed devices received new or updated security recommendations for outdated firmware that were previously invisible." Operational friction surfaced explicitly: vector search endpoints currently lack a scale-to-zero model, "a nuance particularly relevant for the bursty, event-driven nature of industrial security data," requiring specific architectural patterns to maintain ROI during idle periods.
Key takeaways¶
-
CPS identity crisis is an Entity Resolution problem at catalog scale. "At its core, this is an Entity Resolution (ER) challenge and the purpose of the system is to solve the identity crisis by matching and consolidating noisy real-world data into a single source of truth." Catalog size: 17 million+ assets. Quantified pain: 88% of assets don't transmit an exact product code, 76% transmit codes that differ from the vendor's official records. The named output is a CPS-ID ("the new industry standard for cyber-physical system identity by Claroty"), each ID "backed by rigorous data integrity and cross-silo intelligence."
-
Hybrid ER architecture: classic methods + GenAI. "To achieve high-fidelity deterministic traceability, we moved beyond standard matching algorithms, engineering a hybrid architecture that combines battle-tested, classic ER methods with the cognitive power of Generative AI." Canonical instance of patterns/hybrid-classical-er-plus-genai: classical statistical inference + domain-guided logic triangulates exact identity from minimal data; GenAI agents handle the cognitive parsing of mixed-format vendor strings and unstructured documentation. The two halves are complementary, not alternatives.
-
Medallion + Delta CDF + mapping registry as the data substrate. "The journey begins in the Bronze layer, where raw, heterogeneous JSON payloads are captured in append-only Delta tables. From there, a promotion pipeline — reading from Delta Change Data Feed (CDF) — dynamically applies a mapping registry to transform raw evidence into a governed, canonical schema. By utilizing Delta Lake's schema evolution and time travel, Claroty maintains an unbreakable chain of custody; every asset record is traceable back to its original raw artifact and the specific mapping version that classified it, ensuring full auditability in even the most sensitive industrial environments." This is the Medallion pattern with CDF as the layer-transition trigger, and the mapping registry is the versioned classification logic — both the data and its classifier are auditable through time.
-
Orchestrated three-role agent system: NLP / Reasoning / HITL. "Rather than relying on a single monolithic model, Claroty engineered an Orchestrated Multi-Agent System, a synchronized network where specialized AI agents collaborate to interpret complex signals." The roles are explicit:
- NLP Agents — "parse complex, mixed-format data — including protocol-derived naming strings and obscure software markers that standard models often miss."
- Reasoning Agents — "apply confidence scoring and statistical tests to weigh evidence, discriminating high-fidelity signals from noise to ensure data integrity."
-
Human-in-the-loop (HITL) — "a critical feedback mechanism that flags low-confidence mappings for expert to review. The output from these sessions is fed back into the system, retraining the models for continuous accuracy gains." Canonicalised as patterns/orchestrated-multi-agent-entity-resolution — the role-decomposition is the recurring shape, not the specific agent count.
-
Domain-specific embeddings via Model Serving custom endpoints; fine-tuning is the explicit roadmap. "To tackle the nuances of healthcare and OT, generic embeddings were insufficient for the level of precision we require. We identified that for the 'Universal Translator' to truly succeed, generic RAG architectures must evolve into domain-specific frameworks. We currently bridge this gap by deploying best-in-class medical embedding models as custom endpoints using Databricks Model Serving. However, as we look to the future, we see fine-tuning these models as the next logical step to ensure our agents understand the most obscure industrial dialects with deterministic accuracy." New Model Serving face on the wiki: not just real-time LLM inference at 200K QPS (the Superhuman face) but a substrate for hosting domain-specific embedding models as custom endpoints when generic embeddings underspecify the domain.
-
Knowledge Assistant + Information Extraction agent for proprietary documents. "We harnessed the Knowledge Assistant to build robust RAG (Retrieval-Augmented Generation) systems capable of ingesting vast amounts of proprietary documentation. By utilizing an Information Extraction agent, we can structurally parse unstructured proprietary documents, turning raw text into actionable intelligence for the CPS Library." Named Databricks GenAI primitives layered on top of UC + Delta + Model Serving; structurally similar to MapAid's groundwater-archive extraction shape but with a different end-state (asset identity, not document search).
-
MLflow LLM-as-a-Judge as the production-monitoring substrate against concept drift. "We implemented a comprehensive evaluation strategy using 'LLM as a Judge' alongside manual labeling sessions. MLflow capabilities allowed us to constantly evaluate model performance to prevent concept drift." And in the ETL pipeline: "To keep this pipeline reliable at scale, we use an LLM as a Judge approach to continuously score the quality of our own LLM outputs. Instead of relying only on fully labeled ground truth — which is often missing or ambiguous in real-world CPS data — we let a dedicated judge model review another model's response and decide whether it looks acceptable. The judge's job is simple and conservative: mark each result as pass, looks correct, fail, looks wrong, or unknown, not enough information." This canonicalises a third LLM-as-judge face: not eval-harness (Storex) and not ranking-data labelling (Dropbox Dash) but continuous-production-monitoring against concept drift when ground-truth is missing or ambiguous, with the pass/fail/unknown ternary explicitly conservative for regulated CPS data.
-
Lakebase as the transactional ER layer. "For the 'Library' to work, the data must be consistent and highly available. Claroty integrates Lakebase, a fully managed transactional data layer on Databricks. Lakebase is built on Postgres and provides the low-latency performance required for real-time queries while maintaining a seamless link to the broader Lakehouse for analytical processing, allowing strict constraints to make sure our data keeps its high quality and ensuring that asset mappings remain accurate even as configurations drift." New Lakebase face: not just app-tier state-store (clinical-ops face) or branching/PITR (Backstage face) but the transactional asset-mapping store for an Entity Resolution catalog — Postgres constraints are explicitly load-bearing for ER data integrity (no duplicate CPS-IDs, FK to mapping-registry version, etc.).
-
Databricks Apps + Lakebase as the HITL substrate. "With the Databricks App and Lakebase, we enable a transparent view and a seamless 'human-in-the-loop' feedback cycle. This intuitive interface allows domain experts to review classifications, correct and enrich entities, and feed high-fidelity, validated data back into our MLflow pipelines and R&D migration, ensuring the system grows smarter and more accurate over time." New Databricks Apps face: HITL UI for ER over Lakebase as the operational state. Composes with the single-platform application architecture thesis from the 2026-05-13 clinical-ops source — same shape, different workflow (SME-driven ER feedback vs site-feasibility recommendation).
-
Vector-search endpoints currently lack scale-to-zero — the explicit production-cost observation. "One area of strategic focus is the cost-efficiency of our Vector Search indices. While the performance is world-class, the current lack of a 'scale-to-zero' model for vector endpoints — a nuance particularly relevant for the bursty, event-driven nature of industrial security data — requires us to design specific architectural patterns to maintain high ROI during idle periods." Canonicalised as concepts/vector-search-no-scale-to-zero — a real production pain point with a concrete economic implication for any bursty / event-driven workload running over hosted vector search. Structurally analogous to concepts/gpu-scale-to-zero-cold-start (the cold-start cost of getting a GPU back from idle) but at the vector index level.
-
Lakeflow Jobs orchestrating CSAF→Delta ETL with AI Functions. "To handle the vast amount of information from various sources, Claroty uses Lakeflow Jobs to orchestrate the full process - from raw data to a well structured table. One of our pipelines orchestrates an ETL process that parses CSAF, a JSON formatted security advisory, into a tabular structure. In this process, each step reads and writes entries into a dedicated delta table. In this ETL, and in many more use cases, we use LLMs to enrich the data - from classification tasks and AI Functions like ai_query, using various Serving endpoints and MLflow to evaluate the answers we get from the LLM, using statistic metrics and LLM-as-a-judge, and monitor the cost." New Lakeflow Jobs face: orchestrating a security-advisory ETL where each step lands into a Delta table and
ai_querycalls fan out to Serving endpoints, with MLflow capturing eval metrics and cost telemetry per step. Composes with LLM-judge as inline pipeline stage.
Operational numbers cited¶
- 17 million+ assets in the global CPS catalog backed by the library.
- 88% of CPS assets do not transmit an exact product code.
- 76% of CPS assets transmit product codes that differ from the vendor's official records (per Claroty Team82 research).
- 25% improvement in vulnerability identification accuracy attributed to identifying specific sub-components and firmware trees.
- 56% of analyzed devices received new or updated security recommendations for outdated firmware "that were previously invisible to security teams."
- Worked example identifier: Rockwell Automation 1769-L36ERMS/B (CIP protocol model number) → Compact GuardLogix 5370 controller → CVE-2020-6998 (versions 33 and earlier).
- Industry recognition: Claroty named "a Leader in the 2025 Gartner® Magic Quadrant™ for CPS Protection Platforms, positioned highest for 'Ability to Execute'."
Caveats¶
- Tier-3 vendor co-marketing. Heavy framing as a Databricks GenAI MVP success story; numbers (17M+, 25%, 56%, 88%, 76%) are reported without methodology, baseline, or external validation. The "first-of-its-kind" / "Revolutionary" / "Universal Translator" / "Rosetta Stone" framing is marketing, not architecture. The architecture content (~30% of body) is what the wiki ingested; the marketing wrapper is not.
- No latency, throughput, or cost numbers disclosed. The vector-search-no-scale-to-zero observation is qualitative — the post says it requires "specific architectural patterns" to maintain ROI, but does not name those patterns. No QPS, p99, ingestion-rate, or training-budget numbers.
- No agent-count or model-version disclosures. The multi-agent system is described at the role-decomposition altitude (NLP / Reasoning / HITL); the actual count of agents, model families, embedding dimensionality, and per-role prompt structure are not disclosed.
- No comparison to non-hybrid baselines. The 25% accuracy improvement attribution is given without naming the baseline (classical ER alone? generic embeddings? prior manual workflow?). Similarly the 56% new-recommendations number does not compare against a non-hybrid baseline.
- HITL feedback loop closure is asserted, not measured. "Output from these sessions is fed back into the system, retraining the models for continuous accuracy gains" — no cadence, sample-volume, or accuracy-delta-per-cycle numbers.
- No deployment-region, multi-tenancy, or isolation-architecture disclosure. The catalog backs Claroty's xDome customer fleet but the multi-tenant / per-customer-data-residency posture is not described.
Source¶
- Original: https://www.databricks.com/blog/rosetta-stone-cps-clarotys-ai-powered-library
- Raw markdown:
raw/databricks/2026-05-13-the-rosetta-stone-of-cps-clarotys-ai-powered-library-48c4fccf.md - Referenced inside the post: Claroty Team82 research report on the CPS identity crisis; Databricks AI Functions documentation; MLflow GenAI monitoring native capabilities; Gartner® Magic Quadrant™ for CPS Protection Platforms (2025).
Related¶
- systems/claroty-cps-library — the system this source canonicalises.
- concepts/entity-resolution — the core architectural problem class the system solves.
- concepts/delta-change-data-feed — the Delta capability driving the layer-transition pipeline.
- concepts/medallion-architecture — the data-substrate pattern (Bronze raw → governed canonical schema).
- concepts/schema-evolution — the audit-chain enabler alongside time travel.
- concepts/llm-as-judge — the production-monitoring substrate for concept drift.
- concepts/vector-search-no-scale-to-zero — the production-cost observation.
- patterns/hybrid-classical-er-plus-genai — the central architectural shape.
- patterns/orchestrated-multi-agent-entity-resolution — the multi-agent decomposition pattern.
- systems/delta-lake · systems/unity-catalog · systems/mlflow · systems/lakebase · systems/databricks-apps · systems/databricks-model-serving · systems/lakeflow-jobs · systems/databricks-ai-functions — the Databricks primitives composed into the architecture.
- companies/databricks — source publisher.