PATTERN Cited by 1 source
Set algebra on Theta sketches¶
Context¶
Marketing measurement, audience analytics, and incrementality testing routinely ask questions shaped like:
- "How many users saw your Super Bowl ad but not your Instagram campaign?" — exclusive reach.
- "How many users saw both campaigns?" — overlap.
- "What is the total reach across all campaigns?" — union.
- "How many users saw channel A this week that didn't see it last week?" — difference over time.
These are fundamentally set-algebra questions over large
distinct-user sets. Exact computation requires collecting every
user ID into memory and performing UNION + JOIN, potentially
shuffling billions of identifiers across the cluster. At
scale, this is "impractical or impossible" (Source:
sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics).
HyperLogLog sketches handle union well but intersection via inclusion-exclusion is unreliable for small overlaps — and difference is even worse.
Pattern¶
Build a Theta sketch per logical set (campaign, channel, segment, day), store the sketches, and evaluate set-algebra queries — union, intersection, difference — locally on the compact binary sketches in microseconds.
Shape¶
-- Build: one Theta sketch per (campaign, day)
CREATE TABLE campaign_audience (
campaign_id STRING,
day DATE,
theta_sketch BINARY
) USING DELTA;
INSERT INTO campaign_audience
SELECT campaign_id, day, theta_sketch_agg(user_id)
FROM exposures
GROUP BY campaign_id, day;
Queries¶
-- Total reach across campaigns A, B, C
SELECT theta_sketch_estimate(
theta_union(theta_sketch)
) AS total_reach
FROM campaign_audience
WHERE campaign_id IN ('A', 'B', 'C');
-- Overlap: users who saw both A and B
WITH a AS (
SELECT theta_union(theta_sketch) AS sketch
FROM campaign_audience WHERE campaign_id = 'A'
),
b AS (
SELECT theta_union(theta_sketch) AS sketch
FROM campaign_audience WHERE campaign_id = 'B'
)
SELECT theta_sketch_estimate(
theta_intersection(a.sketch, b.sketch)
) FROM a, b;
-- Exclusive reach: saw A but not B
SELECT theta_sketch_estimate(
theta_difference(a.sketch, b.sketch)
) FROM a, b;
The critical property (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics):
"The set operations happen locally in microseconds."
Each sketch is kilobytes, each operation is an in-process arithmetic over the retained sample — not a cluster-wide shuffle of raw user IDs.
Consequences¶
Good:
- Audience overlap queries go from batch to live. What used to be "we'll have the overlap report tomorrow" becomes a live dashboard.
- Daily reach curves become practical. Rolling 30-day unique-user reach is a union over 30 daily sketches.
- Cross-channel deduplication is trivial. Union the per- channel sketches to get total unique audience.
- Incrementality measurement is a difference. Users exposed to treatment minus users in control, over any time window.
- The queries are composable.
(A ∪ B) ∩ C= the audience that saw C and at least one of A or B.
Tradeoffs:
- Theta sketches approximate cardinality. Error on intersection can be larger than on union when the intersection is small relative to the inputs — a known property of sample- based intersection estimation. Not a concern for most overlap questions; significant for "did these two niche audiences intersect at all?" types of queries.
- Set algebra has composition limits. Chaining many operations accumulates error; the DataSketches library provides guidance on which chains are stable.
- BLOB storage cost. Per-(campaign, day) sketch adds up; for campaigns in the hundreds and days in the thousands, the sketch column is non-trivial.
Why Theta, not HyperLogLog¶
A common misconception is that HLL handles this case. It does not — HLL's union is excellent, but:
- HLL intersection via inclusion-exclusion (|A∩B| = |A| + |B| − |A∪B|) has error that is a function of all three cardinalities, which blows up when the overlap is small.
- HLL difference has no stable definition.
Theta sketches are built on a retained random sample, which supports intersection and difference natively with well-understood error behaviour.
This distinction is the architectural reason Databricks shipped
Theta sketch functions alongside its existing HLL-style
approx_count_distinct.
Storage integration¶
This pattern is a specialisation of
patterns/precomputed-sketch-column-in-delta-table — the
sketches are mergeable Theta
sketches stored as BLOB columns. The read-side operators
(theta_union, theta_intersection, theta_difference) are
what distinguish "set algebra for audience overlap" from
"precompute percentiles and merge".
Seen in¶
- sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks' canonical framing. Quote: "Theta sketches summarize a set of distinct values in bounded memory and support full set algebra: unions, intersections, and differences. Build a sketch per campaign, then combine them mathematically." The post explicitly names this pattern as enabling "daily reach curves, incrementality measurement, and cross-channel deduplication."
Related¶
- concepts/theta-sketch
- concepts/mergeable-sketch
- concepts/decision-support-vs-audit-query
- concepts/tuple-sketch — extension where per-distinct- value metric (revenue, session-count) rides along with the sketch; set-algebra operations preserve metric aggregation.
- patterns/precomputed-sketch-column-in-delta-table — parent storage pattern.
- patterns/local-global-aggregation-split — the general decomposition this pattern specialises.
- systems/apache-datasketches — underlying library.
- systems/databricks — first wiki-consumer.