Skip to content

CONCEPT Cited by 1 source

Mergeable sketch

Definition

A mergeable sketch is a probabilistic data structure with a merge operator that is associative and commutative, and whose merge produces a new sketch of the same type with error bounds at most as bad as applying the sketch directly to the union of the two underlying input streams.

Formally, a sketch family S is mergeable if for any two streams X and Y:

merge(S(X), S(Y))  ≈  S(X ∪ Y)

The approximation's error is bounded by the family's per-sketch error envelope — merging does not accumulate unbounded error. Mergeability is the semigroup property on sketches: (S, ⊕) where is the merge operator.

Why mergeability is the architectural unlock

Most sketch consumers care less about the algorithm inside the sketch and more about what you can do with it. Mergeability is what enables:

  • Distributed aggregation. Build per-partition sketches in parallel on each worker; reduce them with merge to get a cluster-wide sketch. This is the local-global aggregation pattern generalised to probabilistic structures.
  • Time-window composition. Precompute per-hour sketches; merge 168 of them for a weekly view without rescanning raw events. The dashboard becomes "merge the sketches I need" instead of "scan the raw data for this window".
  • Streaming + batch unification. A streaming system can emit per-micro-batch sketches that are mergeable with a batch pipeline's per-day sketches — both produce sketches of the same type, both are mergeable.
  • Cross-system portability. A sketch written by a Spark ETL is mergeable with a sketch built by a Druid ingestor, as long as they share the wire format and algorithm family — e.g. all Apache DataSketches implementations interoperate.

Without mergeability, sketches are merely faster — a percentile over a partition in one pass. With mergeability, sketches become a storage primitive — precomputed columns that are composable on read (see patterns/precomputed-sketch-column-in-delta-table).

Which sketches are mergeable

The DataSketches family is explicitly built around mergeability. Every sketch in the library ships with a merge / combine function:

  • KLL — merges into a single rank-error-bounded sketch.
  • Theta — merges via union, intersection, or difference (full set algebra).
  • Approx top-K — merges via approx_top_k_combine.
  • Tuple — merges with per-key metric aggregation preserved.

HyperLogLog is also mergeable on union (its canonical use case) but not on intersection — which is part of why Theta exists for audience-overlap analytics.

Mergeability ≠ lossless

A mergeable sketch family still has a per-sketch error envelope (e.g. 1–2% for KLL). The property is that merging does not amplify that error — two merged sketches have an error envelope comparable to a single sketch over the merged input, not the sum of the individual errors. This is what makes it safe to "merge 168 hourly sketches" without the weekly view being 168× worse than the hourly view.

Seen in

  • sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks' 2026-04-29 post is the canonical articulation of mergeability as the architectural property. The post repeatedly uses "mergeable" as an adjective applied to each sketch family. Its dashboard example — "merge the precomputed sketches in milliseconds instead of rescanning raw data" — is the direct payoff of the property. Similarly the 168-sketch weekly-trending merge and the per-micro-batch streaming merge both depend on mergeability.
Last updated · 438 distilled / 1,268 read