SYSTEM Cited by 1 source
Apache DataSketches¶
Apache DataSketches (datasketches.apache.org) is a top-level Apache Software Foundation project providing a production-grade library of probabilistic data structures — sketches — for approximate analytics at scale. It implements algorithms for quantiles, distinct-count with set algebra, heavy hitters, and distinct-count-plus-aggregate, all with bounded memory, configurable relative error, and the mergeable-sketch property that lets sketches compose across shards, partitions, and time windows.
DataSketches predates its Databricks / Spark integration by years — it originated at Yahoo and has been adopted across Druid, Presto / Trino, Hive, Pinot, BigQuery, and streaming systems — but the [[sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics|2026-04-29 Databricks post]] is the first wiki ingest naming it as the underlying library.
Sketch families exposed in Databricks¶
| Family | Answers | Error shape | Databricks SQL example |
|---|---|---|---|
| concepts/kll-quantile-sketch | quantiles / percentiles | bounded relative-rank error | kll_sketch_agg_double |
| concepts/theta-sketch | set cardinality + set algebra | one-sided, relative | theta_sketch_agg |
| concepts/approximate-top-k-sketch | heavy hitters | bounded-memory counter | approx_top_k_accumulate |
| concepts/tuple-sketch | distinct-count + metric aggregation | Theta + per-key value | tuple_sketch_agg_integer |
The Databricks post notes each has corresponding _combine /
_get_* companions for merging and extraction.
Design properties¶
- Bounded memory. Every sketch has a fixed-size summary structure, parameterised by an accuracy knob. Unbounded input does not produce unbounded state.
- Streaming-friendly. One-pass,
add(value)updates with no unbounded history. - Mergeable.
merge(a, b)is associative and commutative — sketches can be combined across partitions, time periods, clusters, and even systems. - Serialisable to compact binary. Sketches are designed to be stored as BLOB columns, passed over the wire, and reconstructed later. This is what makes patterns/precomputed-sketch-column-in-delta-table viable.
- Cross-language interoperability. Implementations in Java, C++, and Python use the same wire format — a sketch written by a Spark ETL is readable by a native Druid or Pinot ingestor.
Seen in¶
- sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks SQL / DataFrame / Structured Streaming add four new sketch function families backed by DataSketches. Community contributions: Christopher Boumalhab (cboumalh on GitHub) implemented the Theta and Tuple sketch function families in upstream Apache Spark, which Databricks' post explicitly calls out. First wiki ingest naming Apache DataSketches.
Related¶
- concepts/probabilistic-data-structure — the broader family of structures DataSketches implements.
- concepts/ddsketch-error-bounded-percentile — a sibling quantile-sketch library (Datadog lineage); similar guarantees, different implementation.
- systems/databricks — first wiki consumer of DataSketches sketch functions.
- systems/apache-spark — upstream OSS engine where the new Theta / Tuple sketch function families landed.
- systems/delta-lake — storage substrate that hosts DataSketches BLOB columns.
Source¶
- Project site: https://datasketches.apache.org/