Skip to content

Databricks — Approximate Answers, Exact Decisions: New Sketch Functions for Analytics

Databricks product-engineering post (2026-04-29) announcing four new sketch function families in Databricks SQL / DataFrame / Structured Streaming — built on Apache DataSketches — that replace exact percentile, distinct-count, top-K, and distinct-count-plus-aggregate computations with bounded-memory, mergeable approximations at a configurable 1–2% relative error. The post's architectural argument is that many decision-support analytics queries do not require exact answers — "if knowing ~4.7M unique users ±1% leads to the same decision as 4,712,389 unique users, the approximate answer at a fraction of the cost is strictly better" — and that treating sketches as first-class, storable, mergeable columns in Delta Lake converts repeated-scan batch queries into millisecond-merge dashboard queries. Contributor mention: Christopher Boumalhab (cboumalh on GitHub) implemented the Theta and Tuple sketch function families in Apache Spark.

Key takeaways

  1. Approximate answers are strictly better when the decision is the same. Databricks frames sketches through the lens of decision support vs. audit:

    "Many analytical questions are decision-support, not audit. If knowing '~4.7M unique users ±1%' leads to the same decision as '4,712,389 unique users,' the approximate answer at a fraction of the cost is strictly better." The converse — where exactness is required — is called out as "Financial auditing, compliance reporting, or any use case where regulatory or business requirements demand precise values." (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

  2. Four sketch families, four query classes, each replacing a full shuffle. The post maps each family to the exact-query failure mode it replaces:

Family Replaces Exact cost Sketch cost
concepts/kll-quantile-sketch PERCENTILE(col, 0.99) global sort over N rows bounded-memory summary
concepts/theta-sketch COUNT(DISTINCT user_id) + set ops collect all IDs into memory, union/join kilobyte binary blob + set algebra in microseconds
concepts/approximate-top-k-sketch GROUP BY x ORDER BY COUNT(*) DESC LIMIT K cluster-wide sort mergeable bounded-memory counter
concepts/tuple-sketch COUNT(DISTINCT) + SUM composed full GROUP BY with dedup one mergeable structure per period

(Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

  1. The real win is the workflow: build once at ETL, merge on read. The post repeatedly frames sketches as a storage primitive, not a query optimisation:

    "Build them once during your daily ETL. Store them as columns in Delta tables. When a dashboard needs P50/P90/P99 for any time range, merge the precomputed sketches in milliseconds instead of rescanning raw data." This is the patterns/precomputed-sketch-column-in-delta-table pattern: each daily (or hourly) sketch column is small, mergeable, and requeryable. A "trending this week" dashboard becomes a merge of 168 precomputed hourly sketches rather than a scan of billions of raw events. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

  2. KLL: quantile summaries with configurable error, extract many quantiles per pass. KLL sketches replace the global sort that PERCENTILE(response_time_ms, 0.99) forces on a billion-row table. Typical relative error 1–2%, configurable. On a single sketch you can extract multiple quantiles in one pass:

    "Extract multiple quantiles from a single sketch in one pass with kll_get_quantile_bigint(sketch, ARRAY(0.5, 0.9, 0.99))." Use cases called out: latency monitoring, capacity planning, anomaly detection. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

  3. Theta sketches implement full set algebra in microseconds. Audience-overlap analysis (Super Bowl ad vs. Instagram campaign: total reach, overlap, exclusive reach) is the canonical use case. Theta sketches summarise a set of distinct values in bounded memory and support union, intersection, and difference:

    "You generate compact binary objects measured in kilobytes, and the set operations happen locally in microseconds." Replaces the exact computation — "a UNION to deduplicate, then a JOIN to find overlap, possibly shuffling raw user IDs twice across your cluster" — with a mergeable in-memory operation. See patterns/set-algebra-on-theta-sketches. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

  4. Approximate top-K sketches enable live leaderboards. For high-cardinality event streams (search logs, clickstreams), exact top-K is "a batch job, not a live query". Approximate top-K sketches track most-frequent items in bounded memory with an accepted tradeoff:

    "Rare items might be dropped, which is fine, because that's not what you're looking for." The approx_top_k_combine function enables streaming merges: a "trending this week" dashboard becomes a merge of 168 pre- computed sketches; for Structured Streaming, merge each micro- batch's sketch into a running total for a live leaderboard. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

  5. Tuple sketches fuse distinct-count and metric aggregation into one mergeable structure. The textbook failure mode: "customers appearing in both periods get double-counted and their revenue overstated." Tuple sketches map each distinct customer to an aggregated spend; merging across days deduplicates customer counts and sums revenue correctly — no reprocessing from raw data when the date range changes. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

  6. Sketches work uniformly across SQL, DataFrame, and Structured Streaming; interoperable with the broader Apache DataSketches ecosystem. The post explicitly commits to ecosystem interoperability:

    "All functions work in SQL, DataFrame, and Structured Streaming pipelines. Sketches created in Spark are interoperable with other systems in the Apache DataSketches ecosystem." Operationally this means a sketch built in a Spark ETL can be read by a C++ / Java / Druid consumer using the same library. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

  7. Agent-discoverable via Genie Code. The post closes with a callout that Genie Code can recommend the right sketch family for a user's workload — a small but notable artifact of Databricks' pattern of threading its agentic product surface through every new capability. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

Operational numbers from the post

  • Error: typical 1–2% relative error, configurable per sketch.
  • Compute reduction: "orders-of-magnitude less compute" vs. exact computation (vendor framing, no benchmark table).
  • Speedup: "a 1% error margin and a 1000x speedup is a welcome trade-off" (vendor framing).
  • Dashboard freshness: 168 hourly sketch merges for a weekly trending view, instead of a scan of billions of raw events.

Systems surfaced

  • systems/apache-datasketches — the underlying library (datasketches.apache.org) implementing KLL, Theta, Top-K, and Tuple sketch families. This post wires Databricks SQL / Spark into the ecosystem.
  • systems/databricks — host runtime (SQL + DataFrame + Structured Streaming).
  • systems/delta-lake — storage substrate where precomputed sketches live as columns.
  • systems/apache-spark — execution engine; this post mentions community contribution of Theta + Tuple sketch function families to upstream Spark.
  • systems/databricks-genie-code — agentic product surface that recommends the right sketch family.

Concepts extracted

Patterns extracted

Caveats

  • Vendor blog. Performance claims (1000× speedup, "orders of magnitude less compute") are self-reported and unbenchmarked in the post.
  • Exact still wins for audit. Post is explicit about the boundary: financial auditing, compliance reporting, regulatory workloads should stay exact.
  • Rare-item loss. Approximate top-K drops rare items by design; not a substitute for exact frequency counts on long-tail analytics.
  • Error budget is not free. Even the "1-2% relative error" envelope has workloads where it breaks consumer expectations (billing, reporting that external parties see).
  • Sketch size is still a budget. "Kilobyte binary objects" scales to billions of distinct users but the per-row column overhead in a Delta table is non-trivial; the post doesn't disclose numbers.

Source

Last updated · 438 distilled / 1,268 read