CONCEPT Cited by 1 source

Tuple sketch¶

Definition¶

A Tuple sketch is a streaming probabilistic data structure that combines distinct-value cardinality estimation with per-key metric aggregation in a single mergeable summary. It is an extension of the Theta sketch family — the "tuple" is the per-distinct-key attached metric value (e.g. customer → aggregated spend) — and is part of the Apache DataSketches library.

The canonical problem tuple sketches solve is the composed aggregation:

"How many unique customers made a purchase this month, and what was their total revenue by region?"

Exact computation requires a large GROUP BY, deduplicating customer IDs while summing purchases across billions of transactions. Worse, you cannot simply add prior results together — customers appearing in both periods get double- counted and their revenue overstated.

(Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

Why Tuple over Theta + separate metric¶

You could solve cardinality and sum separately — a Theta sketch for unique-customer count, a precomputed sum for revenue. But:

Merging loses correctness on the sum. Summing revenue across two periods double-counts customers who appear in both.
The two sketches can't be reconciled — the Theta sketch knows the unique count; the separate sum has no idea which customers are duplicates.

A Tuple sketch maps each distinct customer to its aggregated spend in one structure. When you merge across days:

Customer counts deduplicate automatically (Theta property).
Revenue sums accumulate per-customer — so merging two periods' Tuple sketches gives you correct total revenue from the union of customers.

The key architectural property: merging is the reprocessing. No need to reprocess from raw data every time the date range changes.

Canonical use cases¶

Customer cardinality + revenue aggregation across time windows.
Session count + session duration sum per user cohort.
Unique-devices + error-count per device per release.
Any "distinct X, aggregated-metric Y-per-X" question over a large event stream where the date / cohort window is flexible.

The API shape in Databricks¶

(Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

tuple_sketch_agg_integer(key_col, metric_col) — build a Tuple sketch mapping distinct keys to aggregated integer metric.
Variants for double / decimal metrics.
Merge functions — tuple sketches are mergeable across partitions, days, clusters.

Relationship to Theta¶

A Tuple sketch's cardinality view is identical to a Theta sketch. The Tuple sketch adds a per-key payload (a metric value) that survives the merge. You can reason about a Tuple sketch as "Theta, plus a user-defined summary function per distinct key."

This means patterns that work for Theta — set algebra for audience overlap — extend naturally to Tuple sketches, with the caveat that the per-key metric composes via the payload's aggregation rule (sum, min, max, etc.).

Seen in¶

sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks launches Tuple sketch functions in SQL / DataFrame / Structured Streaming; community contribution: Christopher Boumalhab implemented the Theta and Tuple sketch function families in upstream Apache Spark. Canonical example: unique- customer count and total revenue by region, merged across days without double-counting. The post's framing: "Tuple sketches solve this by combining distinct counting and metric aggregation in a single, mergeable structure."

concepts/theta-sketch — parent family.
concepts/mergeable-sketch — underlying property.
concepts/kll-quantile-sketch — sibling family (quantiles).
concepts/approximate-top-k-sketch — sibling family (heavy hitters).
concepts/decision-support-vs-audit-query — the framing that justifies approximate answers over exact.
systems/apache-datasketches — underlying library.
systems/databricks — first wiki-consumer.
patterns/precomputed-sketch-column-in-delta-table — the Delta-Lake storage pattern that makes tuple sketches useful beyond a single query.