Skip to content

CONCEPT Cited by 1 source

Approximate top-K sketch

Definition

An approximate top-K sketch is a streaming probabilistic data structure that tracks the K most-frequent items in a stream of events, using bounded memory independent of the number of distinct items, and supports merging across partitions and time windows. It is part of the Apache DataSketches family, and sometimes called a heavy-hitter sketch.

Given a high-cardinality event stream (search logs, clickstream, URLs):

  • Exact top-K requires counting every distinct value, storing all those counts, shuffling them across the cluster, and sorting globally. For billion-event streams, "this is a batch job, not a live query".
  • Approximate top-K tracks the K most-frequent items in bounded memory with a well-defined tradeoff: rare items may be dropped, but the reported top-K is within bounded error of the true top-K.

The accepted tradeoff

Databricks' canonical framing (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics):

"Rare items might be dropped, which is fine, because that's not what you're looking for."

A top-K sketch is explicitly not a substitute for exact frequency counts — it's a substitute for the top-K slice of the frequency distribution. Long-tail items fall off the sketch's retention, which is exactly the intended behaviour when the question is "what's trending right now?".

Canonical use cases

  • Trending dashboards. "Trending this week" becomes a merge of 168 pre-computed hourly sketches rather than a scan of billions of raw events. (Source: Databricks 2026-04-29 post.)
  • Streaming leaderboards. In Structured Streaming, merge each micro-batch's sketch into a running total; display results in real time. Live leaderboard from batch computation.
  • Traffic analysis. Top search queries, top URLs, top user agents — any per-key frequency ranking over a high-cardinality event stream.

Workflow: per-window sketch, merge on read

Same pattern as other DataSketches families (see patterns/precomputed-sketch-column-in-delta-table):

  1. Build a top-K sketch per hour (or per micro-batch for streaming) during ingest.
  2. Store as a BLOB column, or accumulate in state.
  3. At read time, call approx_top_k_combine to merge the relevant hourly sketches and extract the top-K items.

The merge operation is associative — cluster-wide top-K across any time window is a reduce over per-window sketches, not a scan over raw events.

The API shape in Databricks

  • approx_top_k_accumulate(col) — build a top-K sketch over events.
  • approx_top_k_combine(sketch_a, sketch_b) — merge two sketches.
  • (Extraction function to project the top-K items from a merged sketch — not named verbatim in the post, but implied by the merge + extract workflow.)

Seen in

Last updated · 438 distilled / 1,268 read