Skip to content

Databricks — How to Build Real-Time Fraud Detection using Spark Real-Time Mode and Lakebase

Databricks Engineering / Solution Accelerator launch post (2026-05-19) that ships an open-source reference implementation of an end-to-end real-time card-fraud-detection system built entirely on the Databricks Platform. The two load-bearing primitives are Real-Time Mode (RTM)Spark Structured Streaming's sub-second processing mode — and Lakebase as the online feature serving layer. Open-source reference implementation lives at databricks-industry-solutions/rtm-fraud-detection.

This post is a Tier-3 borderline-include: launch-post genre with a solution-accelerator framing (skip-signal: "introducing" / "open-source ready to deploy"), but the architecture density is comfortably above the AGENTS.md 20% floor — explicit p50/p99 latency numbers, a named stateful streaming operator (transformWithState) with TTL-bounded per-key state, an explicit Python-dictionary-vs-broadcast-join enrichment choice, the foreach-sink-to-Lakebase deployment shape for online features, an explicit weighted-multi-signal explainable scoring mechanism, and a head-to-head same-engine vs bolt-on Flink positioning anchored to a "up to 92% faster than Apache Flink" benchmark and a Coinbase production deployment running "over 250 ML features" at "sub-100ms P99 processing latencies."

Net new wiki contribution: 2 new systems + 4 new concepts + 4 new patterns + extensions on Apache Spark, Apache Flink, Lakebase, MLflow, Databricks Apps, Kafka, Databricks Asset Bundles, and the Databricks company page.

One-paragraph summary

Card fraud must be blocked in the sub-second window between authorization and settlement; meeting that latency historically forced teams to bolt a specialised streaming engine ( Apache Flink) onto an otherwise Spark-based data platform, doubling operational surface and splitting governance. Databricks' answer is Real-Time Mode for Spark Structured Streaming — sub-300ms processing inside the existing Spark engine, with no new on-call rotation, no logic drift between offline and online, and a one-line trigger flip to revert to slower, cheaper micro-batch when sub-second freshness is not worth the cost. The reference accelerator wires this into a four-stage pipeline — Kafka → RTM (parse + velocity tracking via transformWithState + enrichment + weighted-multi-signal scoring + routing) → Kafka / LakebaseStreamlit Databricks App — and demonstrates that everything runs on one platform: same Spark engine for batch ETL + ML training + sub-300ms streaming; same Unity Catalog for governance; same MLflow for model lineage in batch or real-time inference. End-to-end latency on the Kafka-in-Kafka-out pipeline is P50 < 40ms, P99 215–392ms across varying TPS levels.

Key takeaways

  1. Sub-second Spark Structured Streaming is now first-class. "RTM is an evolution of the Spark Structured Streaming engine that enables sub-second data processing for latency-sensitive operational applications such as feature engineering." Same engine semantics, different trigger; "switching back to a slower trigger is the same one-line code change in the other direction. No need to spend time manually tuning parallelism or orchestrating the shutdown and restart of computing resources." This is the forcing-function answer to speed-vs-simplicity in fraud detection — the prior industry default of bolting Apache Flink alongside Spark "doubles your operational complexity." (Source: sources/2026-05-19-databricks-how-to-build-real-time-fraud-detection-using-spark-rtm-and-lakebase)

  2. RTM is positioned as up-to-92%-faster than Apache Flink across stateless transforms, join-based enrichment, and aggregation workloads. "RTM processes events in milliseconds and is up to 92% faster than Apache Flink across stateless transformation, join-based enrichment, and aggregation workloads." This is a strong positioning claim against the de-facto streaming-engine default — pinned at the wiki tier as Databricks' own benchmark, not an independent third-party study; cross-link to systems/apache-flink which carries this counter-claim. (Source: sources/2026-05-19-databricks-how-to-build-real-time-fraud-detection-using-spark-rtm-and-lakebase)

  3. Coinbase is named as a production RTM customer running over 250 ML features at sub-100ms P99 processing latency. "Customers such as Coinbase are already using RTM to compute over 250 ML features, and have achieved sub-100ms P99 processing latencies." This is the canonical disclosure of an RTM production deployment at a named, regulated customer — pinned as a third-party validation point on the RTM page. (Source: sources/2026-05-19-databricks-how-to-build-real-time-fraud-detection-using-spark-rtm-and-lakebase)

  4. transformWithState is the load-bearing primitive for per-card stateful streaming. "Velocity tracking … using transformWithState (Spark's powerful operator for building arbitrary or custom stateful transformations), the pipeline maintains per-card state across the stream: how many transactions has this card made in the last 60 seconds? A card that suddenly fires five transactions in a minute is exhibiting classic card-testing behavior. The state auto-expires via TTL, so there's no unbounded memory growth and no manual cleanup." This is the wiki's first canonicalisation of transformWithState as a named Spark operator with TTL semantics for streaming velocity counters. (Source: sources/2026-05-19-databricks-how-to-build-real-time-fraud-detection-using-spark-rtm-and-lakebase)

  5. Python-dictionary lookups beat broadcast joins in latency-sensitive streaming enrichment. "These lookups use Python dictionaries rather than broadcast joins, avoiding the BroadcastExchange overhead that can add latency in streaming pipelines." This is a small but specific micro-pattern: when the lookup table is small and stable enough to ship to executors as a Python dict, skip the distributed BroadcastExchange. Pinned as patterns/python-dict-lookup-over-broadcast-join. (Source: sources/2026-05-19-databricks-how-to-build-real-time-fraud-detection-using-spark-rtm-and-lakebase)

  6. Explainability via weighted-multi-signal scoring as a regulatory substrate. "Scoring combines five weighted fraud signals: velocity, geographic anomaly, amount deviation, merchant category risk, and country risk, into a single 0-100 score. Each signal is computed by a dedicated UDF, and the weights are configurable. The result is an explainable score: you can see exactly which signals contributed and by how much." The architectural shape of one UDF per signal + configurable weights + per-signal contribution surfaced is the audit-friendly substrate for regulated fraud-ops workflows; pinned as concepts/explainable-multi-signal-fraud-score. (Source: sources/2026-05-19-databricks-how-to-build-real-time-fraud-detection-using-spark-rtm-and-lakebase)

  7. Lakebase is the online feature-serving layer; foreach-sink is the deployment shape. "Using Spark Structured Streaming's foreach sink with a custom LakebaseFeatureWriter, the pipeline continuously streams per-card features, velocity patterns, average transaction amounts, geographic spread, all directly into Lakebase tables with upsert semantics. Lakebase provides sub-millisecond reads, making it ideal for real-time feature serving without managing external infrastructure." New canonical Lakebase face: online feature store with sub-millisecond read SLO. Pinned as patterns/streaming-features-via-foreach-to-operational-db and concepts/online-feature-serving. (Source: sources/2026-05-19-databricks-how-to-build-real-time-fraud-detection-using-spark-rtm-and-lakebase)

  8. MLflow loads RandomForest as a Spark UDF for in-pipeline inference; same model lineage as offline training. "A RandomForest classifier is trained on historical labeled data using MLflow for experiment tracking and model versioning. The trained model is loaded as a Spark UDF and applied to every transaction in the streaming pipeline." New MLflow face on the wiki: load-as-Spark-UDF for real-time-streaming inference (alongside the previously-pinned batch-inference + LLM-judge + drift-monitoring faces). (Source: sources/2026-05-19-databricks-how-to-build-real-time-fraud-detection-using-spark-rtm-and-lakebase)

  9. Single-platform thesis is the explicit closing claim. "The key insight is that everything runs on one platform. The same Spark engine that powers your batch ETL and ML training now handles sub-300ms streaming. Unity Catalog now governs both your streaming tables and your training data. MLflow now tracks your fraud models, whether they're used in batch inference or real-time scoring. There's no integration gap, no governance split, and no second stack to maintain because everything is on the same platform." Pinned as patterns/single-platform-real-time-fraud-detection — the end-to-end manifest of one engine + one catalog + one model tracker + one app substrate. (Source: sources/2026-05-19-databricks-how-to-build-real-time-fraud-detection-using-spark-rtm-and-lakebase)

Architectural diagram (textual)

The accelerator's high-level data flow as described in the post:

                     ┌─────────────────────┐
                     │  Kafka (Source)     │  ← raw transaction events
                     │  card_id, amount,   │    (Kafka or Kinesis)
                     │  merchant, geo, ch  │
                     └──────────┬──────────┘
        ┌──────────────────────────────────────────────────┐
        │  Spark Structured Streaming — Real-Time Mode     │
        │  trigger=AvailableNow → Real-Time Mode trigger   │
        │  ┌─────────────────────────────────────────┐    │
        │  │  Stage 1: Parse                         │    │
        │  │  → JSON → typed columns                 │    │
        │  ├─────────────────────────────────────────┤    │
        │  │  Stage 2: Velocity tracking             │    │
        │  │  → transformWithState                   │    │
        │  │  → per-card state, TTL-expiring         │    │
        │  ├─────────────────────────────────────────┤    │
        │  │  Stage 3: Enrichment                    │    │
        │  │  → Python-dict lookups                  │    │
        │  │     (NOT broadcast joins)               │    │
        │  │  → merchant risk + cardholder profile   │    │
        │  ├─────────────────────────────────────────┤    │
        │  │  Stage 4: Scoring                       │    │
        │  │  → 5 weighted signal UDFs               │    │
        │  │  → 0–100 explainable score              │    │
        │  │  (advanced: MLflow RandomForest UDF)    │    │
        │  ├─────────────────────────────────────────┤    │
        │  │  Stage 5: Routing                       │    │
        │  │  → approved / flagged / blocked         │    │
        │  └─────────────────────────────────────────┘    │
        │                                                  │
        │  + foreach-sink → LakebaseFeatureWriter          │
        │    streams per-card features into Lakebase       │
        │    (upsert; sub-ms reads)                        │
        └────────┬────────────────────────────┬───────────┘
                 │                            │
                 ▼                            ▼
       ┌─────────────────────┐     ┌──────────────────────┐
       │  Kafka output       │     │  Lakebase            │
       │  approved /         │     │  per-card features   │
       │  flagged / blocked  │     │  (online store)      │
       │  topics             │     │                      │
       └─────────────────────┘     └──────────┬───────────┘
                                  ┌──────────────────────┐
                                  │  Databricks Apps     │
                                  │  (Streamlit)         │
                                  │  live fraud monitor  │
                                  │  → fraud analyst UI  │
                                  │  10s auto-refresh    │
                                  └──────────────────────┘

Operational numbers

  • End-to-end latency (Kafka-in-Kafka-out, RTM rules-based pipeline): "P50 latency under 40 ms and P99 latency ranging between 215-392 ms" across varying TPS levels.
  • RTM vs Flink benchmark (Databricks's own): "up to 92% faster than Apache Flink across stateless transformation, join-based enrichment, and aggregation workloads."
  • Coinbase production deployment: "over 250 ML features" at "sub-100ms P99 processing latencies."
  • Lakebase read latency for feature serving: "sub-millisecond reads."
  • Five weighted fraud signals scored: velocity, geographic anomaly, amount deviation, merchant category risk, country risk. Each computed by a dedicated UDF; weights configurable.
  • Industry context (Nilson Report 2024): "financial institutions lose an estimated $33 billion annually to fraudulent card transactions."
  • Dashboard refresh cadence: "auto-refreshing every 10 seconds."
  • Solution Accelerator deployment: Databricks Asset Bundles"just clone, deploy, and run; the bundle automatically provisions a correctly configured cluster and runs all notebooks in sequence."
  • RTM availability: "Generally Available on Databricks across AWS, Azure, and GCP."

Systems extracted

  • systems/spark-real-time-modeNEW. Spark Structured Streaming Real-Time Mode; sub-300ms processing inside the existing Spark engine; same Structured Streaming API surface; one-line trigger flip to/from micro-batch; positioned as Apache Flink alternative on the Databricks Platform; GA on AWS / Azure / GCP.
  • systems/databricks-rtm-fraud-acceleratorNEW. Open-source reference implementation of a real-time card-fraud-detection system on Databricks; four progressive notebooks (Quick Start → Rules Pipeline → ML-Enhanced → Live Monitoring App); deployed as a Databricks Asset Bundle.
  • systems/apache-sparkEXTENDED. New face: Real-Time Mode evolution of Structured Streaming; sub-second processing inside the same Spark engine.
  • systems/apache-flinkEXTENDED. New positioning entry: Databricks RTM positions itself as "up to 92% faster than Apache Flink" and reduces the operational case for bolting Flink alongside Spark on a Databricks-centric stack.
  • systems/kafkaEXTENDED. New face: Kafka-in-Kafka-out topology with Spark RTM as the processor; routing decisions (approved / flagged / blocked) write to separate output topics.
  • systems/lakebaseEXTENDED. New face: online feature serving with sub-millisecond reads; foreach-sink-to-Lakebase deployment shape canonicalised; complements the previously-pinned decision-support-app-state-store + DBA-automation + in-workspace-app-state-store faces.
  • systems/mlflowEXTENDED. New face: load model as Spark UDF for real-time streaming inference; same MLflow lineage as offline batch training.
  • systems/databricks-appsEXTENDED. New face: live monitoring dashboard for streaming fraud-ops; Streamlit frontend with Lakebase resource binding; 10-second auto-refresh.
  • systems/databricks-asset-bundlesEXTENDED. New face: bundle-as-solution-accelerator-deployment-vehicle; one-command clone + deploy + run for a four-notebook end-to-end pipeline.
  • systems/unity-catalog — referenced once as the unified governance plane across streaming + training data; not extended in this ingest.

Concepts extracted

  • concepts/streaming-feature-engineering-as-stateful-operatorNEW. The architectural shape of using a stateful streaming operator (here: transformWithState) to compute per-key, TTL-bounded features inside the streaming pipeline rather than fetching them from an external store. The state auto-expires; no unbounded memory growth; no manual cleanup. This is the canonical streaming-feature-engineering-in-the-engine primitive.
  • concepts/online-feature-servingNEW. The concept of a low-latency operational store optimised for sub-millisecond reads of pre-computed features at inference time, sitting between a streaming feature-engineering pipeline (writer) and a real-time scoring pipeline (reader). Pinned with Lakebase as the canonical managed-Postgres instance, sub-millisecond read SLO.
  • concepts/single-engine-streaming-consolidationNEW. The design choice of consolidating real-time stream processing into the same engine that runs batch ETL and ML training, rather than bolting a specialised streaming engine alongside. Trade-off: operational simplicity + no logic drift vs raw streaming-specific optimisations of a specialist engine.
  • concepts/explainable-multi-signal-fraud-scoreNEW. The shape of computing N independent fraud signals via N dedicated UDFs, combining via configurable weights into a single 0-100 score, and surfacing per-signal contributions for audit. The architectural answer to "why was this transaction blocked?" in a regulated fraud-ops workflow.
  • concepts/sub-second-stream-processingNEW. The latency regime where stream processing must complete in hundreds of milliseconds rather than seconds; the regime that historically forced specialist engines (Flink) and now is reachable inside Spark via RTM. Pinned with the post's P50 < 40ms / P99 215-392ms numbers as canonical end-to-end latency disclosures.

Patterns extracted

  • patterns/single-platform-real-time-fraud-detectionNEW. End-to-end on one platform: same Spark engine for batch ETL + ML training + sub-second streaming; same Unity Catalog for governance; same MLflow for model lineage in batch and real-time; Lakebase as the online feature store; Databricks Apps as the monitoring UI. The full architectural manifest of "no integration gap, no governance split, no second stack to maintain."
  • patterns/transformwithstate-per-key-velocity-trackingNEW. Using Spark Structured Streaming's transformWithState operator to maintain TTL-bounded per-key counters across a streaming pipeline; canonical instance is per-card transaction velocity for fraud detection (60-second window); generalises to any per-entity rolling-window stateful aggregation that needs bounded memory and zero manual cleanup.
  • patterns/streaming-features-via-foreach-to-operational-dbNEW. Using Spark Structured Streaming's foreach sink with a custom writer to continuously upsert per-key features into a low-latency operational database; canonical instance is a custom LakebaseFeatureWriter writing per-card features (velocity patterns, average transaction amounts, geographic spread) directly into Lakebase tables with upsert semantics for sub-millisecond read serving.
  • patterns/python-dict-lookup-over-broadcast-joinNEW. When a lookup table is small and stable enough, embed it as a Python dictionary on each executor and lookup in-process during streaming enrichment, rather than using a Spark broadcast join. Avoids the BroadcastExchange overhead that materially impacts latency in sub-second streaming pipelines.

The four progressive accelerator notebooks

The post structures the accelerator as a four-step ramp; each step adds platform capability:

  1. RTM_00_Quick_Start.py — synthetic transactions via Spark's built-in rate source; "hello world" for RTM; no Kafka, no external setup; "under five minutes".
  2. RTM_01_Introduction_fraud_detection.py — production-grade five-stage pipeline (parse → velocity tracking via transformWithState → enrichment → weighted-signal scoring → routing); Kafka in / Kafka out; rules-based explainable score; P50 < 40 ms, P99 215–392 ms.
  3. RTM_02_Advanced_fraud_detection_ml.py — adds Lakebase as the online feature store via foreach sink + custom LakebaseFeatureWriter; trains a RandomForest in MLflow; loads the model as a Spark UDF and applies to every transaction; ML-derived scores replace the rule-based weights in the routing stage.
  4. apps/ — Streamlit Databricks App with a Lakebase resource binding; live fraud monitoring dashboard auto-refreshing every 10 seconds; total transactions scored, decision breakdowns, recent fraud scores with card-level detail, fraud probability distributions.

The full bundle deploys as a Databricks Asset Bundle — "just clone, deploy, and run; the bundle automatically provisions a correctly configured cluster and runs all notebooks in sequence."

Caveats

  • Vendor benchmark, not independent. The "up to 92% faster than Apache Flink" claim is from a separate Databricks blog post (linked in this post) that the wiki has not ingested directly. Pinned on systems/spark-real-time-mode and systems/apache-flink with source attribution; treat as Databricks's positioning, not a third-party validation.
  • Benchmark scope. The 92% figure is "across stateless transformation, join-based enrichment, and aggregation workloads" — i.e. the workload mix this fraud-detection accelerator runs. Not generalisable to all stream-processing workloads (event-time windows, complex event-time-watermark semantics, exactly-once-Kafka-source-Kafka-sink with two-phase-commit, large RocksDB-backed keyed state in TB range — Flink's traditional strongest territory — are not in the benchmark scope).
  • transformWithState is Spark 4.x; not Spark 3.x. This is a recent Spark Structured Streaming primitive; teams on older Spark versions will not have it. The wiki notes this on the pattern page.
  • Solution Accelerator framing. The post is a launch announcement for an open-source reference implementation; "open source ready to deploy" is a marketing-adjacent framing. The wiki ingests it for the architecture content (RTM + transformWithState + Lakebase online-feature face + foreach-to-DB pattern + explainable-multi- signal scoring + single-platform thesis) — not for the accelerator-as-product framing.
  • Latency numbers are accelerator-internal, not Coinbase-internal. The P50 < 40ms / P99 215-392ms numbers are from Databricks' testing of the accelerator notebooks. The Coinbase numbers (250+ features, sub-100ms P99) come from the same blog and are not corroborated by a Coinbase engineering post on the wiki.
  • Tier-3 ingest. Per AGENTS.md Tier-3 guidance, this post passes the borderline-include test on architecture density (~30%+); named primitives (transformWithState, foreach-sink-to-Lakebase, Python-dict-vs-broadcast-join, weighted-multi-signal score, single- platform thesis) all carry their own canonical wiki home.

Source

Last updated · 542 distilled / 1,571 read