Skip to content

CONCEPT Cited by 1 source

Tuple sketch

Definition

A Tuple sketch is a streaming probabilistic data structure that combines distinct-value cardinality estimation with per-key metric aggregation in a single mergeable summary. It is an extension of the Theta sketch family — the "tuple" is the per-distinct-key attached metric value (e.g. customer → aggregated spend) — and is part of the Apache DataSketches library.

The canonical problem tuple sketches solve is the composed aggregation:

"How many unique customers made a purchase this month, and what was their total revenue by region?"

Exact computation requires a large GROUP BY, deduplicating customer IDs while summing purchases across billions of transactions. Worse, you cannot simply add prior results together — customers appearing in both periods get double- counted and their revenue overstated.

(Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

Why Tuple over Theta + separate metric

You could solve cardinality and sum separately — a Theta sketch for unique-customer count, a precomputed sum for revenue. But:

  • Merging loses correctness on the sum. Summing revenue across two periods double-counts customers who appear in both.
  • The two sketches can't be reconciled — the Theta sketch knows the unique count; the separate sum has no idea which customers are duplicates.

A Tuple sketch maps each distinct customer to its aggregated spend in one structure. When you merge across days:

  • Customer counts deduplicate automatically (Theta property).
  • Revenue sums accumulate per-customer — so merging two periods' Tuple sketches gives you correct total revenue from the union of customers.

The key architectural property: merging is the reprocessing. No need to reprocess from raw data every time the date range changes.

Canonical use cases

  • Customer cardinality + revenue aggregation across time windows.
  • Session count + session duration sum per user cohort.
  • Unique-devices + error-count per device per release.
  • Any "distinct X, aggregated-metric Y-per-X" question over a large event stream where the date / cohort window is flexible.

The API shape in Databricks

(Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics)

  • tuple_sketch_agg_integer(key_col, metric_col) — build a Tuple sketch mapping distinct keys to aggregated integer metric.
  • Variants for double / decimal metrics.
  • Merge functions — tuple sketches are mergeable across partitions, days, clusters.

Relationship to Theta

A Tuple sketch's cardinality view is identical to a Theta sketch. The Tuple sketch adds a per-key payload (a metric value) that survives the merge. You can reason about a Tuple sketch as "Theta, plus a user-defined summary function per distinct key."

This means patterns that work for Theta — set algebra for audience overlap — extend naturally to Tuple sketches, with the caveat that the per-key metric composes via the payload's aggregation rule (sum, min, max, etc.).

Seen in

  • sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks launches Tuple sketch functions in SQL / DataFrame / Structured Streaming; community contribution: Christopher Boumalhab implemented the Theta and Tuple sketch function families in upstream Apache Spark. Canonical example: unique- customer count and total revenue by region, merged across days without double-counting. The post's framing: "Tuple sketches solve this by combining distinct counting and metric aggregation in a single, mergeable structure."
Last updated · 438 distilled / 1,268 read