CONCEPT Cited by 1 source
Tuple sketch¶
Definition¶
A Tuple sketch is a streaming probabilistic data structure that combines distinct-value cardinality estimation with per-key metric aggregation in a single mergeable summary. It is an extension of the Theta sketch family — the "tuple" is the per-distinct-key attached metric value (e.g. customer → aggregated spend) — and is part of the Apache DataSketches library.
The canonical problem tuple sketches solve is the composed aggregation:
"How many unique customers made a purchase this month, and what was their total revenue by region?"
Exact computation requires a large GROUP BY, deduplicating
customer IDs while summing purchases across billions of
transactions. Worse, you cannot simply add prior results
together — customers appearing in both periods get double-
counted and their revenue overstated.
(Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
Why Tuple over Theta + separate metric¶
You could solve cardinality and sum separately — a Theta sketch for unique-customer count, a precomputed sum for revenue. But:
- Merging loses correctness on the sum. Summing revenue across two periods double-counts customers who appear in both.
- The two sketches can't be reconciled — the Theta sketch knows the unique count; the separate sum has no idea which customers are duplicates.
A Tuple sketch maps each distinct customer to its aggregated spend in one structure. When you merge across days:
- Customer counts deduplicate automatically (Theta property).
- Revenue sums accumulate per-customer — so merging two periods' Tuple sketches gives you correct total revenue from the union of customers.
The key architectural property: merging is the reprocessing. No need to reprocess from raw data every time the date range changes.
Canonical use cases¶
- Customer cardinality + revenue aggregation across time windows.
- Session count + session duration sum per user cohort.
- Unique-devices + error-count per device per release.
- Any "distinct X, aggregated-metric Y-per-X" question over a large event stream where the date / cohort window is flexible.
The API shape in Databricks¶
(Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)
tuple_sketch_agg_integer(key_col, metric_col)— build a Tuple sketch mapping distinct keys to aggregated integer metric.- Variants for double / decimal metrics.
- Merge functions — tuple sketches are mergeable across partitions, days, clusters.
Relationship to Theta¶
A Tuple sketch's cardinality view is identical to a Theta sketch. The Tuple sketch adds a per-key payload (a metric value) that survives the merge. You can reason about a Tuple sketch as "Theta, plus a user-defined summary function per distinct key."
This means patterns that work for Theta — set algebra for audience overlap — extend naturally to Tuple sketches, with the caveat that the per-key metric composes via the payload's aggregation rule (sum, min, max, etc.).
Seen in¶
- sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks launches Tuple sketch functions in SQL / DataFrame / Structured Streaming; community contribution: Christopher Boumalhab implemented the Theta and Tuple sketch function families in upstream Apache Spark. Canonical example: unique- customer count and total revenue by region, merged across days without double-counting. The post's framing: "Tuple sketches solve this by combining distinct counting and metric aggregation in a single, mergeable structure."
Related¶
- concepts/theta-sketch — parent family.
- concepts/mergeable-sketch — underlying property.
- concepts/kll-quantile-sketch — sibling family (quantiles).
- concepts/approximate-top-k-sketch — sibling family (heavy hitters).
- concepts/decision-support-vs-audit-query — the framing that justifies approximate answers over exact.
- systems/apache-datasketches — underlying library.
- systems/databricks — first wiki-consumer.
- patterns/precomputed-sketch-column-in-delta-table — the Delta-Lake storage pattern that makes tuple sketches useful beyond a single query.