SYSTEM Cited by 1 source

Apache DataSketches¶

Apache DataSketches (datasketches.apache.org) is a top-level Apache Software Foundation project providing a production-grade library of probabilistic data structures — sketches — for approximate analytics at scale. It implements algorithms for quantiles, distinct-count with set algebra, heavy hitters, and distinct-count-plus-aggregate, all with bounded memory, configurable relative error, and the mergeable-sketch property that lets sketches compose across shards, partitions, and time windows.

DataSketches predates its Databricks / Spark integration by years — it originated at Yahoo and has been adopted across Druid, Presto / Trino, Hive, Pinot, BigQuery, and streaming systems — but the [[sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics|2026-04-29 Databricks post]] is the first wiki ingest naming it as the underlying library.

Sketch families exposed in Databricks¶

Family	Answers	Error shape	Databricks SQL example
concepts/kll-quantile-sketch	quantiles / percentiles	bounded relative-rank error	`kll_sketch_agg_double`
concepts/theta-sketch	set cardinality + set algebra	one-sided, relative	`theta_sketch_agg`
concepts/approximate-top-k-sketch	heavy hitters	bounded-memory counter	`approx_top_k_accumulate`
concepts/tuple-sketch	distinct-count + metric aggregation	Theta + per-key value	`tuple_sketch_agg_integer`

The Databricks post notes each has corresponding _combine / _get_* companions for merging and extraction.

Design properties¶

Bounded memory. Every sketch has a fixed-size summary structure, parameterised by an accuracy knob. Unbounded input does not produce unbounded state.
Streaming-friendly. One-pass, add(value) updates with no unbounded history.
Mergeable. merge(a, b) is associative and commutative — sketches can be combined across partitions, time periods, clusters, and even systems.
Serialisable to compact binary. Sketches are designed to be stored as BLOB columns, passed over the wire, and reconstructed later. This is what makes patterns/precomputed-sketch-column-in-delta-table viable.
Cross-language interoperability. Implementations in Java, C++, and Python use the same wire format — a sketch written by a Spark ETL is readable by a native Druid or Pinot ingestor.

Seen in¶

sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks SQL / DataFrame / Structured Streaming add four new sketch function families backed by DataSketches. Community contributions: Christopher Boumalhab (cboumalh on GitHub) implemented the Theta and Tuple sketch function families in upstream Apache Spark, which Databricks' post explicitly calls out. First wiki ingest naming Apache DataSketches.

concepts/probabilistic-data-structure — the broader family of structures DataSketches implements.
concepts/ddsketch-error-bounded-percentile — a sibling quantile-sketch library (Datadog lineage); similar guarantees, different implementation.
systems/databricks — first wiki consumer of DataSketches sketch functions.
systems/apache-spark — upstream OSS engine where the new Theta / Tuple sketch function families landed.
systems/delta-lake — storage substrate that hosts DataSketches BLOB columns.

Source¶

Project site: https://datasketches.apache.org/

Apache DataSketches¶

Sketch families exposed in Databricks¶

Design properties¶

Seen in¶

Related¶

Source¶