Skip to content

SYSTEM Cited by 1 source

Apache DataSketches

Apache DataSketches (datasketches.apache.org) is a top-level Apache Software Foundation project providing a production-grade library of probabilistic data structures — sketches — for approximate analytics at scale. It implements algorithms for quantiles, distinct-count with set algebra, heavy hitters, and distinct-count-plus-aggregate, all with bounded memory, configurable relative error, and the mergeable-sketch property that lets sketches compose across shards, partitions, and time windows.

DataSketches predates its Databricks / Spark integration by years — it originated at Yahoo and has been adopted across Druid, Presto / Trino, Hive, Pinot, BigQuery, and streaming systems — but the [[sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics|2026-04-29 Databricks post]] is the first wiki ingest naming it as the underlying library.

Sketch families exposed in Databricks

Family Answers Error shape Databricks SQL example
concepts/kll-quantile-sketch quantiles / percentiles bounded relative-rank error kll_sketch_agg_double
concepts/theta-sketch set cardinality + set algebra one-sided, relative theta_sketch_agg
concepts/approximate-top-k-sketch heavy hitters bounded-memory counter approx_top_k_accumulate
concepts/tuple-sketch distinct-count + metric aggregation Theta + per-key value tuple_sketch_agg_integer

The Databricks post notes each has corresponding _combine / _get_* companions for merging and extraction.

Design properties

  • Bounded memory. Every sketch has a fixed-size summary structure, parameterised by an accuracy knob. Unbounded input does not produce unbounded state.
  • Streaming-friendly. One-pass, add(value) updates with no unbounded history.
  • Mergeable. merge(a, b) is associative and commutative — sketches can be combined across partitions, time periods, clusters, and even systems.
  • Serialisable to compact binary. Sketches are designed to be stored as BLOB columns, passed over the wire, and reconstructed later. This is what makes patterns/precomputed-sketch-column-in-delta-table viable.
  • Cross-language interoperability. Implementations in Java, C++, and Python use the same wire format — a sketch written by a Spark ETL is readable by a native Druid or Pinot ingestor.

Seen in

Source

Last updated · 438 distilled / 1,268 read