CONCEPT Cited by 1 source

Approximate top-K sketch¶

Definition¶

An approximate top-K sketch is a streaming probabilistic data structure that tracks the K most-frequent items in a stream of events, using bounded memory independent of the number of distinct items, and supports merging across partitions and time windows. It is part of the Apache DataSketches family, and sometimes called a heavy-hitter sketch.

Given a high-cardinality event stream (search logs, clickstream, URLs):

Exact top-K requires counting every distinct value, storing all those counts, shuffling them across the cluster, and sorting globally. For billion-event streams, "this is a batch job, not a live query".
Approximate top-K tracks the K most-frequent items in bounded memory with a well-defined tradeoff: rare items may be dropped, but the reported top-K is within bounded error of the true top-K.

The accepted tradeoff¶

Databricks' canonical framing (Source: sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics):

"Rare items might be dropped, which is fine, because that's not what you're looking for."

A top-K sketch is explicitly not a substitute for exact frequency counts — it's a substitute for the top-K slice of the frequency distribution. Long-tail items fall off the sketch's retention, which is exactly the intended behaviour when the question is "what's trending right now?".

Canonical use cases¶

Trending dashboards. "Trending this week" becomes a merge of 168 pre-computed hourly sketches rather than a scan of billions of raw events. (Source: Databricks 2026-04-29 post.)
Streaming leaderboards. In Structured Streaming, merge each micro-batch's sketch into a running total; display results in real time. Live leaderboard from batch computation.
Traffic analysis. Top search queries, top URLs, top user agents — any per-key frequency ranking over a high-cardinality event stream.

Workflow: per-window sketch, merge on read¶

Same pattern as other DataSketches families (see patterns/precomputed-sketch-column-in-delta-table):

Build a top-K sketch per hour (or per micro-batch for streaming) during ingest.
Store as a BLOB column, or accumulate in state.
At read time, call approx_top_k_combine to merge the relevant hourly sketches and extract the top-K items.

The merge operation is associative — cluster-wide top-K across any time window is a reduce over per-window sketches, not a scan over raw events.

The API shape in Databricks¶

approx_top_k_accumulate(col) — build a top-K sketch over events.
approx_top_k_combine(sketch_a, sketch_b) — merge two sketches.
(Extraction function to project the top-K items from a merged sketch — not named verbatim in the post, but implied by the merge + extract workflow.)

Seen in¶

sources/2026-04-29-databricks-approximate-answers-exact-decisions-new-sketch-functions-for-analytics — Databricks exposes approximate top-K as a SQL / DataFrame / Structured Streaming aggregate. Canonical use case: "trending this week" over high-cardinality clickstream. The post frames this as the transformation from a batch job into a live query.

concepts/mergeable-sketch — underlying property.
concepts/theta-sketch — sibling family (distinct-count
set algebra).
concepts/kll-quantile-sketch — sibling family (quantiles).
concepts/tuple-sketch — sibling family (distinct-count + per-key metric).
concepts/decision-support-vs-audit-query — the framing that justifies losing rare items in exchange for sub-second top-K over billion-event streams.
systems/apache-datasketches — underlying library.
systems/databricks — first wiki-consumer.