Databricks — Approximate Answers, Exact Decisions: New Sketch Functions for Analytics¶
Databricks product-engineering post (2026-04-29) announcing four new sketch function families in Databricks SQL / DataFrame / Structured Streaming — built on Apache DataSketches — that replace exact percentile, distinct-count, top-K, and distinct-count-plus-aggregate computations with bounded-memory, mergeable approximations at a configurable 1–2% relative error. The post's architectural argument is that many decision-support analytics queries do not require exact answers — "if knowing ~4.7M unique users ±1% leads to the same decision as 4,712,389 unique users, the approximate answer at a fraction of the cost is strictly better" — and that treating sketches as first-class, storable, mergeable columns in Delta Lake converts repeated-scan batch queries into millisecond-merge dashboard queries. Contributor mention: Christopher Boumalhab (cboumalh on GitHub) implemented the Theta and Tuple sketch function families in Apache Spark.
Key takeaways¶
-
Approximate answers are strictly better when the decision is the same. Databricks frames sketches through the lens of decision support vs. audit:
"Many analytical questions are decision-support, not audit. If knowing '~4.7M unique users ±1%' leads to the same decision as '4,712,389 unique users,' the approximate answer at a fraction of the cost is strictly better." The converse — where exactness is required — is called out as "Financial auditing, compliance reporting, or any use case where regulatory or business requirements demand precise values." (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
-
Four sketch families, four query classes, each replacing a full shuffle. The post maps each family to the exact-query failure mode it replaces:
| Family | Replaces | Exact cost | Sketch cost |
|---|---|---|---|
| concepts/kll-quantile-sketch | PERCENTILE(col, 0.99) |
global sort over N rows | bounded-memory summary |
| concepts/theta-sketch | COUNT(DISTINCT user_id) + set ops |
collect all IDs into memory, union/join | kilobyte binary blob + set algebra in microseconds |
| concepts/approximate-top-k-sketch | GROUP BY x ORDER BY COUNT(*) DESC LIMIT K |
cluster-wide sort | mergeable bounded-memory counter |
| concepts/tuple-sketch | COUNT(DISTINCT) + SUM composed |
full GROUP BY with dedup | one mergeable structure per period |
(Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
-
The real win is the workflow: build once at ETL, merge on read. The post repeatedly frames sketches as a storage primitive, not a query optimisation:
"Build them once during your daily ETL. Store them as columns in Delta tables. When a dashboard needs P50/P90/P99 for any time range, merge the precomputed sketches in milliseconds instead of rescanning raw data." This is the patterns/precomputed-sketch-column-in-delta-table pattern: each daily (or hourly) sketch column is small, mergeable, and requeryable. A "trending this week" dashboard becomes a merge of 168 precomputed hourly sketches rather than a scan of billions of raw events. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
-
KLL: quantile summaries with configurable error, extract many quantiles per pass. KLL sketches replace the global sort that
PERCENTILE(response_time_ms, 0.99)forces on a billion-row table. Typical relative error 1–2%, configurable. On a single sketch you can extract multiple quantiles in one pass:"Extract multiple quantiles from a single sketch in one pass with
kll_get_quantile_bigint(sketch, ARRAY(0.5, 0.9, 0.99))." Use cases called out: latency monitoring, capacity planning, anomaly detection. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics) -
Theta sketches implement full set algebra in microseconds. Audience-overlap analysis (Super Bowl ad vs. Instagram campaign: total reach, overlap, exclusive reach) is the canonical use case. Theta sketches summarise a set of distinct values in bounded memory and support union, intersection, and difference:
"You generate compact binary objects measured in kilobytes, and the set operations happen locally in microseconds." Replaces the exact computation — "a UNION to deduplicate, then a JOIN to find overlap, possibly shuffling raw user IDs twice across your cluster" — with a mergeable in-memory operation. See patterns/set-algebra-on-theta-sketches. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
-
Approximate top-K sketches enable live leaderboards. For high-cardinality event streams (search logs, clickstreams), exact top-K is "a batch job, not a live query". Approximate top-K sketches track most-frequent items in bounded memory with an accepted tradeoff:
"Rare items might be dropped, which is fine, because that's not what you're looking for." The
approx_top_k_combinefunction enables streaming merges: a "trending this week" dashboard becomes a merge of 168 pre- computed sketches; for Structured Streaming, merge each micro- batch's sketch into a running total for a live leaderboard. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics) -
Tuple sketches fuse distinct-count and metric aggregation into one mergeable structure. The textbook failure mode: "customers appearing in both periods get double-counted and their revenue overstated." Tuple sketches map each distinct customer to an aggregated spend; merging across days deduplicates customer counts and sums revenue correctly — no reprocessing from raw data when the date range changes. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
-
Sketches work uniformly across SQL, DataFrame, and Structured Streaming; interoperable with the broader Apache DataSketches ecosystem. The post explicitly commits to ecosystem interoperability:
"All functions work in SQL, DataFrame, and Structured Streaming pipelines. Sketches created in Spark are interoperable with other systems in the Apache DataSketches ecosystem." Operationally this means a sketch built in a Spark ETL can be read by a C++ / Java / Druid consumer using the same library. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
-
Agent-discoverable via Genie Code. The post closes with a callout that Genie Code can recommend the right sketch family for a user's workload — a small but notable artifact of Databricks' pattern of threading its agentic product surface through every new capability. (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
Operational numbers from the post¶
- Error: typical 1–2% relative error, configurable per sketch.
- Compute reduction: "orders-of-magnitude less compute" vs. exact computation (vendor framing, no benchmark table).
- Speedup: "a 1% error margin and a 1000x speedup is a welcome trade-off" (vendor framing).
- Dashboard freshness: 168 hourly sketch merges for a weekly trending view, instead of a scan of billions of raw events.
Systems surfaced¶
- systems/apache-datasketches — the underlying library (datasketches.apache.org) implementing KLL, Theta, Top-K, and Tuple sketch families. This post wires Databricks SQL / Spark into the ecosystem.
- systems/databricks — host runtime (SQL + DataFrame + Structured Streaming).
- systems/delta-lake — storage substrate where precomputed sketches live as columns.
- systems/apache-spark — execution engine; this post mentions community contribution of Theta + Tuple sketch function families to upstream Spark.
- systems/databricks-genie-code — agentic product surface that recommends the right sketch family.
Concepts extracted¶
- concepts/kll-quantile-sketch — configurable-error streaming quantile sketch.
- concepts/theta-sketch — distinct-value set-algebra sketch.
- concepts/approximate-top-k-sketch — bounded-memory heavy-hitter tracker with merge support.
- concepts/tuple-sketch — distinct + metric aggregation in one mergeable structure.
- concepts/mergeable-sketch — the underlying semigroup / associativity property that makes these sketches compose across shards and time windows.
- concepts/decision-support-vs-audit-query — the framing for when to accept an approximate answer.
Patterns extracted¶
- patterns/precomputed-sketch-column-in-delta-table — build sketches once during ETL, store as BLOB columns in Delta, merge on read.
- patterns/set-algebra-on-theta-sketches — audience overlap / incrementality / exclusive reach via union + intersection
- difference on compact binary objects.
- patterns/local-global-aggregation-split — pre-existing wiki pattern that this post maps directly onto (local = per-partition sketch, global = merge).
Caveats¶
- Vendor blog. Performance claims (1000× speedup, "orders of magnitude less compute") are self-reported and unbenchmarked in the post.
- Exact still wins for audit. Post is explicit about the boundary: financial auditing, compliance reporting, regulatory workloads should stay exact.
- Rare-item loss. Approximate top-K drops rare items by design; not a substitute for exact frequency counts on long-tail analytics.
- Error budget is not free. Even the "1-2% relative error" envelope has workloads where it breaks consumer expectations (billing, reporting that external parties see).
- Sketch size is still a budget. "Kilobyte binary objects" scales to billions of distinct users but the per-row column overhead in a Delta table is non-trivial; the post doesn't disclose numbers.
Source¶
- Original: https://www.databricks.com/blog/approximate-answers-exact-decisions-new-sketch-functions-analytics
- Raw markdown:
raw/databricks/2026-04-29-approximate-answers-exact-decisions-new-sketch-functions-for-a608f6a5.md
Related¶
- concepts/ddsketch-error-bounded-percentile — sibling relative-error percentile sketch family (Datadog lineage); similar guarantees, different algorithmic family.
- concepts/sketching-feature-store — Zalando's use of bloom- filter sketches as an ML serving store; a different sketch-as- storage application.
- concepts/sketch-as-mysql-binary-column — PlanetScale Insights pattern of storing DDSketch as a BLOB in MySQL; same sketch-as-column shape, different engine.
- concepts/local-global-aggregation-decomposition — the Flink-side articulation of build locally, merge globally that this post applies to batch + streaming Delta workloads.
- patterns/dual-granularity-rollup-tables — pre-existing rollup-table pattern; sketches extend it to percentile / distinct / top-K / distinct-plus-sum query classes.
- companies/databricks