PATTERN Cited by 1 source

Set algebra on Theta sketches¶

Context¶

Marketing measurement, audience analytics, and incrementality testing routinely ask questions shaped like:

"How many users saw your Super Bowl ad but not your Instagram campaign?" — exclusive reach.
"How many users saw both campaigns?" — overlap.
"What is the total reach across all campaigns?" — union.
"How many users saw channel A this week that didn't see it last week?" — difference over time.

These are fundamentally set-algebra questions over large distinct-user sets. Exact computation requires collecting every user ID into memory and performing UNION + JOIN, potentially shuffling billions of identifiers across the cluster. At scale, this is "impractical or impossible" (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics).

HyperLogLog sketches handle union well but intersection via inclusion-exclusion is unreliable for small overlaps — and difference is even worse.

Pattern¶

Build a Theta sketch per logical set (campaign, channel, segment, day), store the sketches, and evaluate set-algebra queries — union, intersection, difference — locally on the compact binary sketches in microseconds.

Shape¶

-- Build: one Theta sketch per (campaign, day)
CREATE TABLE campaign_audience (
  campaign_id   STRING,
  day           DATE,
  theta_sketch  BINARY
) USING DELTA;

INSERT INTO campaign_audience
SELECT campaign_id, day, theta_sketch_agg(user_id)
FROM exposures
GROUP BY campaign_id, day;

Queries¶

-- Total reach across campaigns A, B, C
SELECT theta_sketch_estimate(
  theta_union(theta_sketch)
) AS total_reach
FROM campaign_audience
WHERE campaign_id IN ('A', 'B', 'C');

-- Overlap: users who saw both A and B
WITH a AS (
  SELECT theta_union(theta_sketch) AS sketch
  FROM campaign_audience WHERE campaign_id = 'A'
),
b AS (
  SELECT theta_union(theta_sketch) AS sketch
  FROM campaign_audience WHERE campaign_id = 'B'
)
SELECT theta_sketch_estimate(
  theta_intersection(a.sketch, b.sketch)
) FROM a, b;

-- Exclusive reach: saw A but not B
SELECT theta_sketch_estimate(
  theta_difference(a.sketch, b.sketch)
) FROM a, b;

The critical property (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics):

"The set operations happen locally in microseconds."

Each sketch is kilobytes, each operation is an in-process arithmetic over the retained sample — not a cluster-wide shuffle of raw user IDs.

Consequences¶

Good:

Audience overlap queries go from batch to live. What used to be "we'll have the overlap report tomorrow" becomes a live dashboard.
Daily reach curves become practical. Rolling 30-day unique-user reach is a union over 30 daily sketches.
Cross-channel deduplication is trivial. Union the per- channel sketches to get total unique audience.
Incrementality measurement is a difference. Users exposed to treatment minus users in control, over any time window.
The queries are composable. (A ∪ B) ∩ C = the audience that saw C and at least one of A or B.

Tradeoffs:

Theta sketches approximate cardinality. Error on intersection can be larger than on union when the intersection is small relative to the inputs — a known property of sample- based intersection estimation. Not a concern for most overlap questions; significant for "did these two niche audiences intersect at all?" types of queries.
Set algebra has composition limits. Chaining many operations accumulates error; the DataSketches library provides guidance on which chains are stable.
BLOB storage cost. Per-(campaign, day) sketch adds up; for campaigns in the hundreds and days in the thousands, the sketch column is non-trivial.

Why Theta, not HyperLogLog¶

A common misconception is that HLL handles this case. It does not — HLL's union is excellent, but:

HLL intersection via inclusion-exclusion (|A∩B| = |A| + |B| − |A∪B|) has error that is a function of all three cardinalities, which blows up when the overlap is small.
HLL difference has no stable definition.

Theta sketches are built on a retained random sample, which supports intersection and difference natively with well-understood error behaviour.

This distinction is the architectural reason Databricks shipped Theta sketch functions alongside its existing HLL-style approx_count_distinct.

Storage integration¶

This pattern is a specialisation of patterns/precomputed-sketch-column-in-delta-table — the sketches are mergeable Theta sketches stored as BLOB columns. The read-side operators (theta_union, theta_intersection, theta_difference) are what distinguish "set algebra for audience overlap" from "precompute percentiles and merge".

Seen in¶

sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks' canonical framing. Quote: "Theta sketches summarize a set of distinct values in bounded memory and support full set algebra: unions, intersections, and differences. Build a sketch per campaign, then combine them mathematically." The post explicitly names this pattern as enabling "daily reach curves, incrementality measurement, and cross-channel deduplication."

concepts/theta-sketch
concepts/mergeable-sketch
concepts/decision-support-vs-audit-query
concepts/tuple-sketch — extension where per-distinct- value metric (revenue, session-count) rides along with the sketch; set-algebra operations preserve metric aggregation.
patterns/precomputed-sketch-column-in-delta-table — parent storage pattern.
patterns/local-global-aggregation-split — the general decomposition this pattern specialises.
systems/apache-datasketches — underlying library.
systems/databricks — first wiki-consumer.