SYSTEM Cited by 1 source
Claroty CPS Library¶
The Claroty AI-Powered CPS Library is the asset-identity layer that backs Claroty's xDome Cyber-Physical Systems protection platform. Its job is Entity Resolution at industrial-catalog scale: take noisy, heterogeneous device evidence collected from plant-floor networks (protocol-derived model strings, vendor codes, firmware markers, OEM PDFs) and consolidate it into a single canonical identifier — a CPS-ID — that links to the correct vulnerability records (CVEs / CISA advisories / NVD / CPE entries). Claroty positions the CPS-ID as "the new industry standard for cyber-physical system identity by Claroty."
Catalog scale at first wiki disclosure: 17 million+ assets (Source: sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library).
Why it exists — the CPS identity crisis¶
Per Claroty Team82 research cited in the source:
- 88% of CPS assets do not transmit an exact product code on the network.
- 76% of CPS assets transmit codes that differ from the vendor's official records.
Without a "digital birth certificate," vulnerability management
becomes manual detective work. The source's worked example: a
device reports the Rockwell Automation model string
1769-L36ERMS/B over the CIP protocol. To turn that into an
actionable risk view, security staff would need to manually
search Rockwell catalogs to identify it as a Compact GuardLogix
5370 controller, search CISA advisories for that name to find
CVE-2020-6998 ("versions 33 and earlier"), and check NVD
to see whether the specific CPE matches — only to find a
general entry for "CompactLogix 5370 L3" that may or may not
include the GuardLogix sub-type. Multiplied across the 17M-asset
catalog, this is intractable. The CPS Library "automates this
entire process" and turns "a confusing string of characters
into a clear, secure setup in milliseconds."
Architecture (as disclosed)¶
A hybrid pipeline composing classical Entity Resolution with a multi-agent GenAI system, both running on Databricks primitives. See patterns/hybrid-classical-er-plus-genai for the named pattern.
Data substrate — Medallion + Delta CDF + mapping registry¶
- Bronze layer — "raw, heterogeneous JSON payloads are captured in append-only Delta tables." Sources include proprietary OT protocols, API calls, and unstructured vendor PDF manuals.
- Promotion pipeline — "reading from Delta Change Data Feed (CDF) — dynamically applies a mapping registry to transform raw evidence into a governed, canonical schema." See concepts/delta-change-data-feed.
- Audit chain — "By utilizing Delta Lake's schema evolution and time travel, Claroty maintains an unbreakable chain of custody; every asset record is traceable back to its original raw artifact and the specific mapping version that classified it, ensuring full auditability in even the most sensitive industrial environments."
- Governance — Unity Catalog provides "the governed data foundation needed to unify these diverse datasets," and Spark-powered pipelines normalise at scale.
Multi-agent intelligence — three roles¶
The source frames this as an "Orchestrated Multi-Agent System, a synchronized network where specialized AI agents collaborate to interpret complex signals."
- NLP Agents — "Parse complex, mixed-format data — including protocol-derived naming strings and obscure software markers that standard models often miss."
- Reasoning Agents — "Apply confidence scoring and statistical tests to weigh evidence, discriminating high-fidelity signals from noise to ensure data integrity."
- Human-in-the-loop (HITL) — "A critical feedback mechanism that flags low-confidence mappings for expert to review. The output from these sessions is fed back into the system, retraining the models for continuous accuracy gains."
Domain-specific RAG — embeddings as custom Model Serving endpoints¶
"To tackle the nuances of healthcare and OT, generic embeddings were insufficient for the level of precision we require. We identified that for the 'Universal Translator' to truly succeed, generic RAG architectures must evolve into domain-specific frameworks. We currently bridge this gap by deploying best-in-class medical embedding models as custom endpoints using Databricks Model Serving. However, as we look to the future, we see fine-tuning these models as the next logical step to ensure our agents understand the most obscure industrial dialects with deterministic accuracy." See systems/databricks-model-serving.
The RAG layer is built on Databricks Knowledge Assistant; "By utilizing an Information Extraction agent, we can structurally parse unstructured proprietary documents, turning raw text into actionable intelligence for the CPS Library."
MLflow lifecycle — LLM-as-Judge against concept drift¶
"We implemented a comprehensive evaluation strategy using 'LLM as a Judge' alongside manual labeling sessions. MLflow capabilities allowed us to constantly evaluate model performance to prevent concept drift." In production: "track token usage and infrastructure costs, identifying latency bottlenecks, and detecting potential bugs before they impact users."
The judge model is explicitly conservative: "mark each result as pass, looks correct, fail, looks wrong, or unknown, not enough information." All judge outputs persist in Delta tables; custom MLflow GenAI judges run structured evaluations from the collected samples. See concepts/llm-as-judge.
Transactional layer — Lakebase¶
"For the 'Library' to work, the data must be consistent and highly available. Claroty integrates Lakebase, a fully managed transactional data layer on Databricks. Lakebase is built on Postgres and provides the low-latency performance required for real-time queries while maintaining a seamless link to the broader Lakehouse for analytical processing, allowing strict constraints to make sure our data keeps its high quality and ensuring that asset mappings remain accurate even as configurations drift." See systems/lakebase.
HITL UI — Databricks Apps over Lakebase¶
"With the Databricks App and Lakebase, we enable a transparent view and a seamless 'human-in-the-loop' feedback cycle. This intuitive interface allows domain experts to review classifications, correct and enrich entities, and feed high-fidelity, validated data back into our MLflow pipelines and R&D migration." Modern UI frameworks (React or Streamlit) for the frontend, Lakebase for transactional workloads, all hosted inside the Databricks workspace under Unity Catalog identity. See systems/databricks-apps.
Pipeline orchestration — Lakeflow Jobs + AI Functions¶
"To handle the vast amount of information from various sources, Claroty uses Lakeflow Jobs to orchestrate the full process — from raw data to a well structured table. One of our pipelines orchestrates an ETL process that parses CSAF, a JSON formatted security advisory, into a tabular structure. In this process, each step reads and writes entries into a dedicated delta table. In this ETL, and in many more use cases, we use LLMs to enrich the data — from classification tasks and AI Functions like ai_query, using various Serving endpoints and MLflow to evaluate the answers we get from the LLM." See systems/lakeflow-jobs and systems/databricks-ai-functions.
Reported outcomes¶
- Vulnerability attribution accuracy +25% — "By identifying specific sub-components and firmware trees, the library has improved the accuracy of identifying vulnerabilities by 25%."
- Newly surfaced firmware risk on 56% of devices — "In early tests, 56% of analyzed devices received new or updated security recommendations for outdated firmware that were previously invisible to security teams."
- Deterministic traceability — "Even when a device reports minimal data, the library uses statistical inference and domain-guided logic to triangulate its exact identity."
(Source: sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library)
Known operational friction¶
The source surfaces one explicit production-cost concern:
"One area of strategic focus is the cost-efficiency of our Vector Search indices. While the performance is world-class, the current lack of a 'scale-to-zero' model for vector endpoints — a nuance particularly relevant for the bursty, event-driven nature of industrial security data — requires us to design specific architectural patterns to maintain high ROI during idle periods."
Canonicalised on the wiki as concepts/vector-search-no-scale-to-zero. The architectural patterns Claroty uses to mitigate this are not disclosed.
Open architectural questions for future ingests¶
The source canonicalises the architecture at role-decomposition altitude. Reserved for future Claroty disclosures:
- Agent count and per-role model selection — how many NLP agents, which model family per role, prompt structure per role.
- Mapping-registry update cadence — how new mapping rules are authored, validated, and promoted; rollback story when a registry update misclassifies prior assets.
- HITL feedback loop closure — sample volume per cycle, cadence, and measured accuracy delta per retrain.
- Multi-tenancy / per-customer data residency — how the shared 17M-asset catalog composes with per-xDome-customer data isolation.
- Vector-search architectural patterns — what specific patterns Claroty uses to compensate for no scale-to-zero on vector endpoints.
Related¶
- concepts/entity-resolution — the problem class.
- concepts/medallion-architecture · concepts/delta-change-data-feed · concepts/schema-evolution — the data substrate.
- concepts/llm-as-judge — production monitoring substrate.
- concepts/vector-search-no-scale-to-zero — surfaced operational friction.
- patterns/hybrid-classical-er-plus-genai — the architectural pattern this system instantiates.
- patterns/orchestrated-multi-agent-entity-resolution — the multi-agent decomposition pattern.
- systems/databricks-model-serving · systems/mlflow · systems/lakebase · systems/databricks-apps · systems/lakeflow-jobs · systems/databricks-ai-functions — the Databricks primitives composed into the system.
- companies/databricks — co-marketed publication venue.