Skip to content

PATTERN Cited by 1 source

Set algebra on Theta sketches

Context

Marketing measurement, audience analytics, and incrementality testing routinely ask questions shaped like:

  • "How many users saw your Super Bowl ad but not your Instagram campaign?" — exclusive reach.
  • "How many users saw both campaigns?" — overlap.
  • "What is the total reach across all campaigns?" — union.
  • "How many users saw channel A this week that didn't see it last week?" — difference over time.

These are fundamentally set-algebra questions over large distinct-user sets. Exact computation requires collecting every user ID into memory and performing UNION + JOIN, potentially shuffling billions of identifiers across the cluster. At scale, this is "impractical or impossible" (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics).

HyperLogLog sketches handle union well but intersection via inclusion-exclusion is unreliable for small overlaps — and difference is even worse.

Pattern

Build a Theta sketch per logical set (campaign, channel, segment, day), store the sketches, and evaluate set-algebra queries — union, intersection, difference — locally on the compact binary sketches in microseconds.

Shape

-- Build: one Theta sketch per (campaign, day)
CREATE TABLE campaign_audience (
  campaign_id   STRING,
  day           DATE,
  theta_sketch  BINARY
) USING DELTA;

INSERT INTO campaign_audience
SELECT campaign_id, day, theta_sketch_agg(user_id)
FROM exposures
GROUP BY campaign_id, day;

Queries

-- Total reach across campaigns A, B, C
SELECT theta_sketch_estimate(
  theta_union(theta_sketch)
) AS total_reach
FROM campaign_audience
WHERE campaign_id IN ('A', 'B', 'C');

-- Overlap: users who saw both A and B
WITH a AS (
  SELECT theta_union(theta_sketch) AS sketch
  FROM campaign_audience WHERE campaign_id = 'A'
),
b AS (
  SELECT theta_union(theta_sketch) AS sketch
  FROM campaign_audience WHERE campaign_id = 'B'
)
SELECT theta_sketch_estimate(
  theta_intersection(a.sketch, b.sketch)
) FROM a, b;

-- Exclusive reach: saw A but not B
SELECT theta_sketch_estimate(
  theta_difference(a.sketch, b.sketch)
) FROM a, b;

The critical property (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics):

"The set operations happen locally in microseconds."

Each sketch is kilobytes, each operation is an in-process arithmetic over the retained sample — not a cluster-wide shuffle of raw user IDs.

Consequences

Good:

  • Audience overlap queries go from batch to live. What used to be "we'll have the overlap report tomorrow" becomes a live dashboard.
  • Daily reach curves become practical. Rolling 30-day unique-user reach is a union over 30 daily sketches.
  • Cross-channel deduplication is trivial. Union the per- channel sketches to get total unique audience.
  • Incrementality measurement is a difference. Users exposed to treatment minus users in control, over any time window.
  • The queries are composable. (A ∪ B) ∩ C = the audience that saw C and at least one of A or B.

Tradeoffs:

  • Theta sketches approximate cardinality. Error on intersection can be larger than on union when the intersection is small relative to the inputs — a known property of sample- based intersection estimation. Not a concern for most overlap questions; significant for "did these two niche audiences intersect at all?" types of queries.
  • Set algebra has composition limits. Chaining many operations accumulates error; the DataSketches library provides guidance on which chains are stable.
  • BLOB storage cost. Per-(campaign, day) sketch adds up; for campaigns in the hundreds and days in the thousands, the sketch column is non-trivial.

Why Theta, not HyperLogLog

A common misconception is that HLL handles this case. It does not — HLL's union is excellent, but:

  • HLL intersection via inclusion-exclusion (|A∩B| = |A| + |B| − |A∪B|) has error that is a function of all three cardinalities, which blows up when the overlap is small.
  • HLL difference has no stable definition.

Theta sketches are built on a retained random sample, which supports intersection and difference natively with well-understood error behaviour.

This distinction is the architectural reason Databricks shipped Theta sketch functions alongside its existing HLL-style approx_count_distinct.

Storage integration

This pattern is a specialisation of patterns/precomputed-sketch-column-in-delta-table — the sketches are mergeable Theta sketches stored as BLOB columns. The read-side operators (theta_union, theta_intersection, theta_difference) are what distinguish "set algebra for audience overlap" from "precompute percentiles and merge".

Seen in

Last updated · 438 distilled / 1,268 read