Skip to content

CONCEPT Cited by 1 source

Theta sketch

Definition

A Theta sketch is a streaming probabilistic data structure for distinct-value cardinality estimation that additionally supports full set algebra — union, intersection, and difference — in microseconds over compact binary summaries. It is part of the Apache DataSketches family.

Given a stream of values (e.g. user IDs):

  • Exact COUNT(DISTINCT user_id) + set ops require collecting every ID into memory and performing set operations, potentially shuffling billions of identifiers across a cluster.
  • A Theta sketch summarises the set of distinct values in bounded memory (kilobytes). Set operations — theta_union, theta_intersection, theta_difference — operate locally on the sketch bytes in microseconds.

The shape of the sketch is a bounded random sample of the input hash space (a "theta" — the name comes from the retention threshold θ), which makes set algebra representable in closed form over the samples.

Why Theta over HyperLogLog

Both Theta and HyperLogLog estimate distinct cardinality, but:

  • HyperLogLog is a register-based sketch with excellent cardinality accuracy and cheap merge (union). It does not natively support intersection or difference — HLL intersection via inclusion-exclusion is notoriously unreliable for small overlaps.
  • Theta sketches are built around a retained random sample. They naturally support union, intersection, and difference with well-understood error behaviour.

Databricks' 2026-04-29 post pitches Theta specifically for audience-overlap analysis where intersection and difference are essential.

Canonical use cases

  • Audience / campaign overlap analysis. "How many users saw your Super Bowl ad but not your Instagram campaign?" Build a Theta sketch per campaign, then compute reach = union, overlap = intersection, exclusive = difference.
  • Incrementality measurement. A/B test: users exposed to treatment vs. control; Theta difference gives the exclusive treatment audience.
  • Cross-channel deduplication. Total unique users across email + push + SMS = union of per-channel Theta sketches.
  • Daily reach curves. Precompute a per-day Theta; union over the date window for rolling reach.

Workflow: sketch per group, merge at query time

The pattern is:

  1. During daily ETL: theta_sketch_agg(user_id) GROUP BY campaign_id, day. Store the result as a BLOB column in a Delta Lake table, keyed by campaign_id, day.
  2. At query time: read the relevant rows and apply set-algebra functions (theta_union, theta_intersection, theta_difference) on the sketches locally.

Databricks' framing (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics):

"The set operations happen locally in microseconds."

This converts what would be a cluster-wide join + shuffle into an in-process arithmetic operation on kilobytes.

The API shape in Databricks

  • theta_sketch_agg(col) — build a Theta sketch over col.
  • theta_union(sketch_a, sketch_b) — union.
  • theta_intersection(sketch_a, sketch_b) — intersection.
  • theta_difference(sketch_a, sketch_b) — difference.
  • theta_sketch_estimate(sketch) — return distinct-count estimate.

All merge operators are associative — sketches can be combined across partitions, days, or clusters.

Seen in

Last updated · 438 distilled / 1,268 read